123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259 |
- SQUASHFS 4.0 FILESYSTEM
- =======================
- Squashfs is a compressed read-only filesystem for Linux.
- It uses zlib, lz4, lzo, or xz compression to compress files, inodes and
- directories. Inodes in the system are very small and all blocks are packed to
- minimise data overhead. Block sizes greater than 4K are supported up to a
- maximum of 1Mbytes (default block size 128K).
- Squashfs is intended for general read-only filesystem use, for archival
- use (i.e. in cases where a .tar.gz file may be used), and in constrained
- block device/memory systems (e.g. embedded systems) where low overhead is
- needed.
- Mailing list: squashfs-devel@lists.sourceforge.net
- Web site: www.squashfs.org
- 1. FILESYSTEM FEATURES
- ----------------------
- Squashfs filesystem features versus Cramfs:
- Squashfs Cramfs
- Max filesystem size: 2^64 256 MiB
- Max file size: ~ 2 TiB 16 MiB
- Max files: unlimited unlimited
- Max directories: unlimited unlimited
- Max entries per directory: unlimited unlimited
- Max block size: 1 MiB 4 KiB
- Metadata compression: yes no
- Directory indexes: yes no
- Sparse file support: yes no
- Tail-end packing (fragments): yes no
- Exportable (NFS etc.): yes no
- Hard link support: yes no
- "." and ".." in readdir: yes no
- Real inode numbers: yes no
- 32-bit uids/gids: yes no
- File creation time: yes no
- Xattr support: yes no
- ACL support: no no
- Squashfs compresses data, inodes and directories. In addition, inode and
- directory data are highly compacted, and packed on byte boundaries. Each
- compressed inode is on average 8 bytes in length (the exact length varies on
- file type, i.e. regular file, directory, symbolic link, and block/char device
- inodes have different sizes).
- 2. USING SQUASHFS
- -----------------
- As squashfs is a read-only filesystem, the mksquashfs program must be used to
- create populated squashfs filesystems. This and other squashfs utilities
- can be obtained from http://www.squashfs.org. Usage instructions can be
- obtained from this site also.
- The squashfs-tools development tree is now located on kernel.org
- git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
- 3. SQUASHFS FILESYSTEM DESIGN
- -----------------------------
- A squashfs filesystem consists of a maximum of nine parts, packed together on a
- byte alignment:
- ---------------
- | superblock |
- |---------------|
- | compression |
- | options |
- |---------------|
- | datablocks |
- | & fragments |
- |---------------|
- | inode table |
- |---------------|
- | directory |
- | table |
- |---------------|
- | fragment |
- | table |
- |---------------|
- | export |
- | table |
- |---------------|
- | uid/gid |
- | lookup table |
- |---------------|
- | xattr |
- | table |
- ---------------
- Compressed data blocks are written to the filesystem as files are read from
- the source directory, and checked for duplicates. Once all file data has been
- written the completed inode, directory, fragment, export, uid/gid lookup and
- xattr tables are written.
- 3.1 Compression options
- -----------------------
- Compressors can optionally support compression specific options (e.g.
- dictionary size). If non-default compression options have been used, then
- these are stored here.
- 3.2 Inodes
- ----------
- Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each
- compressed block is prefixed by a two byte length, the top bit is set if the
- block is uncompressed. A block will be uncompressed if the -noI option is set,
- or if the compressed block was larger than the uncompressed block.
- Inodes are packed into the metadata blocks, and are not aligned to block
- boundaries, therefore inodes overlap compressed blocks. Inodes are identified
- by a 48-bit number which encodes the location of the compressed metadata block
- containing the inode, and the byte offset into that block where the inode is
- placed (<block, offset>).
- To maximise compression there are different inodes for each file type
- (regular file, directory, device, etc.), the inode contents and length
- varying with the type.
- To further maximise compression, two types of regular file inode and
- directory inode are defined: inodes optimised for frequently occurring
- regular files and directories, and extended types where extra
- information has to be stored.
- 3.3 Directories
- ---------------
- Like inodes, directories are packed into compressed metadata blocks, stored
- in a directory table. Directories are accessed using the start address of
- the metablock containing the directory and the offset into the
- decompressed block (<block, offset>).
- Directories are organised in a slightly complex way, and are not simply
- a list of file names. The organisation takes advantage of the
- fact that (in most cases) the inodes of the files will be in the same
- compressed metadata block, and therefore, can share the start block.
- Directories are therefore organised in a two level list, a directory
- header containing the shared start block value, and a sequence of directory
- entries, each of which share the shared start block. A new directory header
- is written once/if the inode start block changes. The directory
- header/directory entry list is repeated as many times as necessary.
- Directories are sorted, and can contain a directory index to speed up
- file lookup. Directory indexes store one entry per metablock, each entry
- storing the index/filename mapping to the first directory header
- in each metadata block. Directories are sorted in alphabetical order,
- and at lookup the index is scanned linearly looking for the first filename
- alphabetically larger than the filename being looked up. At this point the
- location of the metadata block the filename is in has been found.
- The general idea of the index is to ensure only one metadata block needs to be
- decompressed to do a lookup irrespective of the length of the directory.
- This scheme has the advantage that it doesn't require extra memory overhead
- and doesn't require much extra storage on disk.
- 3.4 File data
- -------------
- Regular files consist of a sequence of contiguous compressed blocks, and/or a
- compressed fragment block (tail-end packed block). The compressed size
- of each datablock is stored in a block list contained within the
- file inode.
- To speed up access to datablocks when reading 'large' files (256 Mbytes or
- larger), the code implements an index cache that caches the mapping from
- block index to datablock location on disk.
- The index cache allows Squashfs to handle large files (up to 1.75 TiB) while
- retaining a simple and space-efficient block list on disk. The cache
- is split into slots, caching up to eight 224 GiB files (128 KiB blocks).
- Larger files use multiple slots, with 1.75 TiB files using all 8 slots.
- The index cache is designed to be memory efficient, and by default uses
- 16 KiB.
- 3.5 Fragment lookup table
- -------------------------
- Regular files can contain a fragment index which is mapped to a fragment
- location on disk and compressed size using a fragment lookup table. This
- fragment lookup table is itself stored compressed into metadata blocks.
- A second index table is used to locate these. This second index table for
- speed of access (and because it is small) is read at mount time and cached
- in memory.
- 3.6 Uid/gid lookup table
- ------------------------
- For space efficiency regular files store uid and gid indexes, which are
- converted to 32-bit uids/gids using an id look up table. This table is
- stored compressed into metadata blocks. A second index table is used to
- locate these. This second index table for speed of access (and because it
- is small) is read at mount time and cached in memory.
- 3.7 Export table
- ----------------
- To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems
- can optionally (disabled with the -no-exports Mksquashfs option) contain
- an inode number to inode disk location lookup table. This is required to
- enable Squashfs to map inode numbers passed in filehandles to the inode
- location on disk, which is necessary when the export code reinstantiates
- expired/flushed inodes.
- This table is stored compressed into metadata blocks. A second index table is
- used to locate these. This second index table for speed of access (and because
- it is small) is read at mount time and cached in memory.
- 3.8 Xattr table
- ---------------
- The xattr table contains extended attributes for each inode. The xattrs
- for each inode are stored in a list, each list entry containing a type,
- name and value field. The type field encodes the xattr prefix
- ("user.", "trusted." etc) and it also encodes how the name/value fields
- should be interpreted. Currently the type indicates whether the value
- is stored inline (in which case the value field contains the xattr value),
- or if it is stored out of line (in which case the value field stores a
- reference to where the actual value is stored). This allows large values
- to be stored out of line improving scanning and lookup performance and it
- also allows values to be de-duplicated, the value being stored once, and
- all other occurrences holding an out of line reference to that value.
- The xattr lists are packed into compressed 8K metadata blocks.
- To reduce overhead in inodes, rather than storing the on-disk
- location of the xattr list inside each inode, a 32-bit xattr id
- is stored. This xattr id is mapped into the location of the xattr
- list using a second xattr id lookup table.
- 4. TODOS AND OUTSTANDING ISSUES
- -------------------------------
- 4.1 Todo list
- -------------
- Implement ACL support.
- 4.2 Squashfs internal cache
- ---------------------------
- Blocks in Squashfs are compressed. To avoid repeatedly decompressing
- recently accessed data Squashfs uses two small metadata and fragment caches.
- The cache is not used for file datablocks, these are decompressed and cached in
- the page-cache in the normal way. The cache is used to temporarily cache
- fragment and metadata blocks which have been read as a result of a metadata
- (i.e. inode or directory) or fragment access. Because metadata and fragments
- are packed together into blocks (to gain greater compression) the read of a
- particular piece of metadata or fragment will retrieve other metadata/fragments
- which have been packed with it, these because of locality-of-reference may be
- read in the near future. Temporarily caching them ensures they are available
- for near future access without requiring an additional read and decompress.
- In the future this internal cache may be replaced with an implementation which
- uses the kernel page cache. Because the page cache operates on page sized
- units this may introduce additional complexity in terms of locking and
- associated race conditions.
|