1.. SPDX-License-Identifier: GPL-2.0
2
3=======================
4Squashfs 4.0 Filesystem
5=======================
6
7Squashfs is a compressed read-only filesystem for Linux.
8
9It uses zlib, lz4, lzo, xz or zstd compression to compress files, inodes and
10directories.  Inodes in the system are very small and all blocks are packed to
11minimise data overhead. Block sizes greater than 4K are supported up to a
12maximum of 1Mbytes (default block size 128K).
13
14Squashfs is intended for general read-only filesystem use, for archival
15use (i.e. in cases where a .tar.gz file may be used), and in constrained
16block device/memory systems (e.g. embedded systems) where low overhead is
17needed.
18
19Mailing list (kernel code): [email protected]
20Web site: github.com/plougher/squashfs-tools
21
221. Filesystem Features
23----------------------
24
25Squashfs filesystem features versus Cramfs:
26
27============================== 	=========		==========
28				Squashfs		Cramfs
29============================== 	=========		==========
30Max filesystem size		2^64			256 MiB
31Max file size			~ 2 TiB			16 MiB
32Max files			unlimited		unlimited
33Max directories			unlimited		unlimited
34Max entries per directory	unlimited		unlimited
35Max block size			1 MiB			4 KiB
36Metadata compression		yes			no
37Directory indexes		yes			no
38Sparse file support		yes			no
39Tail-end packing (fragments)	yes			no
40Exportable (NFS etc.)		yes			no
41Hard link support		yes			no
42"." and ".." in readdir		yes			no
43Real inode numbers		yes			no
4432-bit uids/gids		yes			no
45File creation time		yes			no
46Xattr support			yes			no
47ACL support			no			no
48============================== 	=========		==========
49
50Squashfs compresses data, inodes and directories.  In addition, inode and
51directory data are highly compacted, and packed on byte boundaries.  Each
52compressed inode is on average 8 bytes in length (the exact length varies on
53file type, i.e. regular file, directory, symbolic link, and block/char device
54inodes have different sizes).
55
562. Using Squashfs
57-----------------
58
59As squashfs is a read-only filesystem, the mksquashfs program must be used to
60create populated squashfs filesystems.  This and other squashfs utilities
61are very likely packaged by your linux distribution (called squashfs-tools).
62The source code can be obtained from github.com/plougher/squashfs-tools.
63Usage instructions can also be obtained from this site.
64
652.1 Mount options
66-----------------
67===================    =========================================================
68errors=%s              Specify whether squashfs errors trigger a kernel panic
69                       or not
70
71		       ==========  =============================================
72                         continue  errors don't trigger a panic (default)
73                            panic  trigger a panic when errors are encountered,
74                                   similar to several other filesystems (e.g.
75                                   btrfs, ext4, f2fs, GFS2, jfs, ntfs, ubifs)
76
77                                   This allows a kernel dump to be saved,
78                                   useful for analyzing and debugging the
79                                   corruption.
80                       ==========  =============================================
81threads=%s             Select the decompression mode or the number of threads
82
83                       If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is set:
84
85		       ==========  =============================================
86                           single  use single-threaded decompression (default)
87
88                                   Only one block (data or metadata) can be
89                                   decompressed at any one time. This limits
90                                   CPU and memory usage to a minimum, but it
91                                   also gives poor performance on parallel I/O
92                                   workloads when using multiple CPU machines
93                                   due to waiting on decompressor availability.
94                            multi  use up to two parallel decompressors per core
95
96                                   If you have a parallel I/O workload and your
97                                   system has enough memory, using this option
98                                   may improve overall I/O performance. It
99                                   dynamically allocates decompressors on a
100                                   demand basis.
101                           percpu  use a maximum of one decompressor per core
102
103                                   It uses percpu variables to ensure
104                                   decompression is load-balanced across the
105                                   cores.
106                        1|2|3|...  configure the number of threads used for
107                                   decompression
108
109                                   The upper limit is num_online_cpus() * 2.
110                       ==========  =============================================
111
112                       If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is **not** set and
113                       SQUASHFS_DECOMP_MULTI, SQUASHFS_MOUNT_DECOMP_THREADS are
114                       both set:
115
116		       ==========  =============================================
117                          2|3|...  configure the number of threads used for
118                                   decompression
119
120                                   The upper limit is num_online_cpus() * 2.
121                       ==========  =============================================
122
123===================    =========================================================
124
1253. Squashfs Filesystem Design
126-----------------------------
127
128A squashfs filesystem consists of a maximum of nine parts, packed together on a
129byte alignment::
130
131	 ---------------
132	|  superblock 	|
133	|---------------|
134	|  compression  |
135	|    options    |
136	|---------------|
137	|  datablocks   |
138	|  & fragments  |
139	|---------------|
140	|  inode table	|
141	|---------------|
142	|   directory	|
143	|     table     |
144	|---------------|
145	|   fragment	|
146	|    table      |
147	|---------------|
148	|    export     |
149	|    table      |
150	|---------------|
151	|    uid/gid	|
152	|  lookup table	|
153	|---------------|
154	|     xattr     |
155	|     table	|
156	 ---------------
157
158Compressed data blocks are written to the filesystem as files are read from
159the source directory, and checked for duplicates.  Once all file data has been
160written the completed inode, directory, fragment, export, uid/gid lookup and
161xattr tables are written.
162
1633.1 Compression options
164-----------------------
165
166Compressors can optionally support compression specific options (e.g.
167dictionary size).  If non-default compression options have been used, then
168these are stored here.
169
1703.2 Inodes
171----------
172
173Metadata (inodes and directories) are compressed in 8Kbyte blocks.  Each
174compressed block is prefixed by a two byte length, the top bit is set if the
175block is uncompressed.  A block will be uncompressed if the -noI option is set,
176or if the compressed block was larger than the uncompressed block.
177
178Inodes are packed into the metadata blocks, and are not aligned to block
179boundaries, therefore inodes overlap compressed blocks.  Inodes are identified
180by a 48-bit number which encodes the location of the compressed metadata block
181containing the inode, and the byte offset into that block where the inode is
182placed (<block, offset>).
183
184To maximise compression there are different inodes for each file type
185(regular file, directory, device, etc.), the inode contents and length
186varying with the type.
187
188To further maximise compression, two types of regular file inode and
189directory inode are defined: inodes optimised for frequently occurring
190regular files and directories, and extended types where extra
191information has to be stored.
192
1933.3 Directories
194---------------
195
196Like inodes, directories are packed into compressed metadata blocks, stored
197in a directory table.  Directories are accessed using the start address of
198the metablock containing the directory and the offset into the
199decompressed block (<block, offset>).
200
201Directories are organised in a slightly complex way, and are not simply
202a list of file names.  The organisation takes advantage of the
203fact that (in most cases) the inodes of the files will be in the same
204compressed metadata block, and therefore, can share the start block.
205Directories are therefore organised in a two level list, a directory
206header containing the shared start block value, and a sequence of directory
207entries, each of which share the shared start block.  A new directory header
208is written once/if the inode start block changes.  The directory
209header/directory entry list is repeated as many times as necessary.
210
211Directories are sorted, and can contain a directory index to speed up
212file lookup.  Directory indexes store one entry per metablock, each entry
213storing the index/filename mapping to the first directory header
214in each metadata block.  Directories are sorted in alphabetical order,
215and at lookup the index is scanned linearly looking for the first filename
216alphabetically larger than the filename being looked up.  At this point the
217location of the metadata block the filename is in has been found.
218The general idea of the index is to ensure only one metadata block needs to be
219decompressed to do a lookup irrespective of the length of the directory.
220This scheme has the advantage that it doesn't require extra memory overhead
221and doesn't require much extra storage on disk.
222
2233.4 File data
224-------------
225
226Regular files consist of a sequence of contiguous compressed blocks, and/or a
227compressed fragment block (tail-end packed block).   The compressed size
228of each datablock is stored in a block list contained within the
229file inode.
230
231To speed up access to datablocks when reading 'large' files (256 Mbytes or
232larger), the code implements an index cache that caches the mapping from
233block index to datablock location on disk.
234
235The index cache allows Squashfs to handle large files (up to 1.75 TiB) while
236retaining a simple and space-efficient block list on disk.  The cache
237is split into slots, caching up to eight 224 GiB files (128 KiB blocks).
238Larger files use multiple slots, with 1.75 TiB files using all 8 slots.
239The index cache is designed to be memory efficient, and by default uses
24016 KiB.
241
2423.5 Fragment lookup table
243-------------------------
244
245Regular files can contain a fragment index which is mapped to a fragment
246location on disk and compressed size using a fragment lookup table.  This
247fragment lookup table is itself stored compressed into metadata blocks.
248A second index table is used to locate these.  This second index table for
249speed of access (and because it is small) is read at mount time and cached
250in memory.
251
2523.6 Uid/gid lookup table
253------------------------
254
255For space efficiency regular files store uid and gid indexes, which are
256converted to 32-bit uids/gids using an id look up table.  This table is
257stored compressed into metadata blocks.  A second index table is used to
258locate these.  This second index table for speed of access (and because it
259is small) is read at mount time and cached in memory.
260
2613.7 Export table
262----------------
263
264To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems
265can optionally (disabled with the -no-exports Mksquashfs option) contain
266an inode number to inode disk location lookup table.  This is required to
267enable Squashfs to map inode numbers passed in filehandles to the inode
268location on disk, which is necessary when the export code reinstantiates
269expired/flushed inodes.
270
271This table is stored compressed into metadata blocks.  A second index table is
272used to locate these.  This second index table for speed of access (and because
273it is small) is read at mount time and cached in memory.
274
2753.8 Xattr table
276---------------
277
278The xattr table contains extended attributes for each inode.  The xattrs
279for each inode are stored in a list, each list entry containing a type,
280name and value field.  The type field encodes the xattr prefix
281("user.", "trusted." etc) and it also encodes how the name/value fields
282should be interpreted.  Currently the type indicates whether the value
283is stored inline (in which case the value field contains the xattr value),
284or if it is stored out of line (in which case the value field stores a
285reference to where the actual value is stored).  This allows large values
286to be stored out of line improving scanning and lookup performance and it
287also allows values to be de-duplicated, the value being stored once, and
288all other occurrences holding an out of line reference to that value.
289
290The xattr lists are packed into compressed 8K metadata blocks.
291To reduce overhead in inodes, rather than storing the on-disk
292location of the xattr list inside each inode, a 32-bit xattr id
293is stored.  This xattr id is mapped into the location of the xattr
294list using a second xattr id lookup table.
295
2964. TODOs and Outstanding Issues
297-------------------------------
298
2994.1 TODO list
300-------------
301
302Implement ACL support.
303
3044.2 Squashfs Internal Cache
305---------------------------
306
307Blocks in Squashfs are compressed.  To avoid repeatedly decompressing
308recently accessed data Squashfs uses two small metadata and fragment caches.
309
310The cache is not used for file datablocks, these are decompressed and cached in
311the page-cache in the normal way.  The cache is used to temporarily cache
312fragment and metadata blocks which have been read as a result of a metadata
313(i.e. inode or directory) or fragment access.  Because metadata and fragments
314are packed together into blocks (to gain greater compression) the read of a
315particular piece of metadata or fragment will retrieve other metadata/fragments
316which have been packed with it, these because of locality-of-reference may be
317read in the near future. Temporarily caching them ensures they are available
318for near future access without requiring an additional read and decompress.
319
320In the future this internal cache may be replaced with an implementation which
321uses the kernel page cache.  Because the page cache operates on page sized
322units this may introduce additional complexity in terms of locking and
323associated race conditions.
324