1.. SPDX-License-Identifier: GPL-2.0 2 3======================= 4Squashfs 4.0 Filesystem 5======================= 6 7Squashfs is a compressed read-only filesystem for Linux. 8 9It uses zlib, lz4, lzo, xz or zstd compression to compress files, inodes and 10directories. Inodes in the system are very small and all blocks are packed to 11minimise data overhead. Block sizes greater than 4K are supported up to a 12maximum of 1Mbytes (default block size 128K). 13 14Squashfs is intended for general read-only filesystem use, for archival 15use (i.e. in cases where a .tar.gz file may be used), and in constrained 16block device/memory systems (e.g. embedded systems) where low overhead is 17needed. 18 19Mailing list (kernel code): [email protected] 20Web site: github.com/plougher/squashfs-tools 21 221. Filesystem Features 23---------------------- 24 25Squashfs filesystem features versus Cramfs: 26 27============================== ========= ========== 28 Squashfs Cramfs 29============================== ========= ========== 30Max filesystem size 2^64 256 MiB 31Max file size ~ 2 TiB 16 MiB 32Max files unlimited unlimited 33Max directories unlimited unlimited 34Max entries per directory unlimited unlimited 35Max block size 1 MiB 4 KiB 36Metadata compression yes no 37Directory indexes yes no 38Sparse file support yes no 39Tail-end packing (fragments) yes no 40Exportable (NFS etc.) yes no 41Hard link support yes no 42"." and ".." in readdir yes no 43Real inode numbers yes no 4432-bit uids/gids yes no 45File creation time yes no 46Xattr support yes no 47ACL support no no 48============================== ========= ========== 49 50Squashfs compresses data, inodes and directories. In addition, inode and 51directory data are highly compacted, and packed on byte boundaries. Each 52compressed inode is on average 8 bytes in length (the exact length varies on 53file type, i.e. regular file, directory, symbolic link, and block/char device 54inodes have different sizes). 55 562. Using Squashfs 57----------------- 58 59As squashfs is a read-only filesystem, the mksquashfs program must be used to 60create populated squashfs filesystems. This and other squashfs utilities 61are very likely packaged by your linux distribution (called squashfs-tools). 62The source code can be obtained from github.com/plougher/squashfs-tools. 63Usage instructions can also be obtained from this site. 64 652.1 Mount options 66----------------- 67=================== ========================================================= 68errors=%s Specify whether squashfs errors trigger a kernel panic 69 or not 70 71 ========== ============================================= 72 continue errors don't trigger a panic (default) 73 panic trigger a panic when errors are encountered, 74 similar to several other filesystems (e.g. 75 btrfs, ext4, f2fs, GFS2, jfs, ntfs, ubifs) 76 77 This allows a kernel dump to be saved, 78 useful for analyzing and debugging the 79 corruption. 80 ========== ============================================= 81threads=%s Select the decompression mode or the number of threads 82 83 If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is set: 84 85 ========== ============================================= 86 single use single-threaded decompression (default) 87 88 Only one block (data or metadata) can be 89 decompressed at any one time. This limits 90 CPU and memory usage to a minimum, but it 91 also gives poor performance on parallel I/O 92 workloads when using multiple CPU machines 93 due to waiting on decompressor availability. 94 multi use up to two parallel decompressors per core 95 96 If you have a parallel I/O workload and your 97 system has enough memory, using this option 98 may improve overall I/O performance. It 99 dynamically allocates decompressors on a 100 demand basis. 101 percpu use a maximum of one decompressor per core 102 103 It uses percpu variables to ensure 104 decompression is load-balanced across the 105 cores. 106 1|2|3|... configure the number of threads used for 107 decompression 108 109 The upper limit is num_online_cpus() * 2. 110 ========== ============================================= 111 112 If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is **not** set and 113 SQUASHFS_DECOMP_MULTI, SQUASHFS_MOUNT_DECOMP_THREADS are 114 both set: 115 116 ========== ============================================= 117 2|3|... configure the number of threads used for 118 decompression 119 120 The upper limit is num_online_cpus() * 2. 121 ========== ============================================= 122 123=================== ========================================================= 124 1253. Squashfs Filesystem Design 126----------------------------- 127 128A squashfs filesystem consists of a maximum of nine parts, packed together on a 129byte alignment:: 130 131 --------------- 132 | superblock | 133 |---------------| 134 | compression | 135 | options | 136 |---------------| 137 | datablocks | 138 | & fragments | 139 |---------------| 140 | inode table | 141 |---------------| 142 | directory | 143 | table | 144 |---------------| 145 | fragment | 146 | table | 147 |---------------| 148 | export | 149 | table | 150 |---------------| 151 | uid/gid | 152 | lookup table | 153 |---------------| 154 | xattr | 155 | table | 156 --------------- 157 158Compressed data blocks are written to the filesystem as files are read from 159the source directory, and checked for duplicates. Once all file data has been 160written the completed inode, directory, fragment, export, uid/gid lookup and 161xattr tables are written. 162 1633.1 Compression options 164----------------------- 165 166Compressors can optionally support compression specific options (e.g. 167dictionary size). If non-default compression options have been used, then 168these are stored here. 169 1703.2 Inodes 171---------- 172 173Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each 174compressed block is prefixed by a two byte length, the top bit is set if the 175block is uncompressed. A block will be uncompressed if the -noI option is set, 176or if the compressed block was larger than the uncompressed block. 177 178Inodes are packed into the metadata blocks, and are not aligned to block 179boundaries, therefore inodes overlap compressed blocks. Inodes are identified 180by a 48-bit number which encodes the location of the compressed metadata block 181containing the inode, and the byte offset into that block where the inode is 182placed (<block, offset>). 183 184To maximise compression there are different inodes for each file type 185(regular file, directory, device, etc.), the inode contents and length 186varying with the type. 187 188To further maximise compression, two types of regular file inode and 189directory inode are defined: inodes optimised for frequently occurring 190regular files and directories, and extended types where extra 191information has to be stored. 192 1933.3 Directories 194--------------- 195 196Like inodes, directories are packed into compressed metadata blocks, stored 197in a directory table. Directories are accessed using the start address of 198the metablock containing the directory and the offset into the 199decompressed block (<block, offset>). 200 201Directories are organised in a slightly complex way, and are not simply 202a list of file names. The organisation takes advantage of the 203fact that (in most cases) the inodes of the files will be in the same 204compressed metadata block, and therefore, can share the start block. 205Directories are therefore organised in a two level list, a directory 206header containing the shared start block value, and a sequence of directory 207entries, each of which share the shared start block. A new directory header 208is written once/if the inode start block changes. The directory 209header/directory entry list is repeated as many times as necessary. 210 211Directories are sorted, and can contain a directory index to speed up 212file lookup. Directory indexes store one entry per metablock, each entry 213storing the index/filename mapping to the first directory header 214in each metadata block. Directories are sorted in alphabetical order, 215and at lookup the index is scanned linearly looking for the first filename 216alphabetically larger than the filename being looked up. At this point the 217location of the metadata block the filename is in has been found. 218The general idea of the index is to ensure only one metadata block needs to be 219decompressed to do a lookup irrespective of the length of the directory. 220This scheme has the advantage that it doesn't require extra memory overhead 221and doesn't require much extra storage on disk. 222 2233.4 File data 224------------- 225 226Regular files consist of a sequence of contiguous compressed blocks, and/or a 227compressed fragment block (tail-end packed block). The compressed size 228of each datablock is stored in a block list contained within the 229file inode. 230 231To speed up access to datablocks when reading 'large' files (256 Mbytes or 232larger), the code implements an index cache that caches the mapping from 233block index to datablock location on disk. 234 235The index cache allows Squashfs to handle large files (up to 1.75 TiB) while 236retaining a simple and space-efficient block list on disk. The cache 237is split into slots, caching up to eight 224 GiB files (128 KiB blocks). 238Larger files use multiple slots, with 1.75 TiB files using all 8 slots. 239The index cache is designed to be memory efficient, and by default uses 24016 KiB. 241 2423.5 Fragment lookup table 243------------------------- 244 245Regular files can contain a fragment index which is mapped to a fragment 246location on disk and compressed size using a fragment lookup table. This 247fragment lookup table is itself stored compressed into metadata blocks. 248A second index table is used to locate these. This second index table for 249speed of access (and because it is small) is read at mount time and cached 250in memory. 251 2523.6 Uid/gid lookup table 253------------------------ 254 255For space efficiency regular files store uid and gid indexes, which are 256converted to 32-bit uids/gids using an id look up table. This table is 257stored compressed into metadata blocks. A second index table is used to 258locate these. This second index table for speed of access (and because it 259is small) is read at mount time and cached in memory. 260 2613.7 Export table 262---------------- 263 264To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems 265can optionally (disabled with the -no-exports Mksquashfs option) contain 266an inode number to inode disk location lookup table. This is required to 267enable Squashfs to map inode numbers passed in filehandles to the inode 268location on disk, which is necessary when the export code reinstantiates 269expired/flushed inodes. 270 271This table is stored compressed into metadata blocks. A second index table is 272used to locate these. This second index table for speed of access (and because 273it is small) is read at mount time and cached in memory. 274 2753.8 Xattr table 276--------------- 277 278The xattr table contains extended attributes for each inode. The xattrs 279for each inode are stored in a list, each list entry containing a type, 280name and value field. The type field encodes the xattr prefix 281("user.", "trusted." etc) and it also encodes how the name/value fields 282should be interpreted. Currently the type indicates whether the value 283is stored inline (in which case the value field contains the xattr value), 284or if it is stored out of line (in which case the value field stores a 285reference to where the actual value is stored). This allows large values 286to be stored out of line improving scanning and lookup performance and it 287also allows values to be de-duplicated, the value being stored once, and 288all other occurrences holding an out of line reference to that value. 289 290The xattr lists are packed into compressed 8K metadata blocks. 291To reduce overhead in inodes, rather than storing the on-disk 292location of the xattr list inside each inode, a 32-bit xattr id 293is stored. This xattr id is mapped into the location of the xattr 294list using a second xattr id lookup table. 295 2964. TODOs and Outstanding Issues 297------------------------------- 298 2994.1 TODO list 300------------- 301 302Implement ACL support. 303 3044.2 Squashfs Internal Cache 305--------------------------- 306 307Blocks in Squashfs are compressed. To avoid repeatedly decompressing 308recently accessed data Squashfs uses two small metadata and fragment caches. 309 310The cache is not used for file datablocks, these are decompressed and cached in 311the page-cache in the normal way. The cache is used to temporarily cache 312fragment and metadata blocks which have been read as a result of a metadata 313(i.e. inode or directory) or fragment access. Because metadata and fragments 314are packed together into blocks (to gain greater compression) the read of a 315particular piece of metadata or fragment will retrieve other metadata/fragments 316which have been packed with it, these because of locality-of-reference may be 317read in the near future. Temporarily caching them ensures they are available 318for near future access without requiring an additional read and decompress. 319 320In the future this internal cache may be replaced with an implementation which 321uses the kernel page cache. Because the page cache operates on page sized 322units this may introduce additional complexity in terms of locking and 323associated race conditions. 324