1## Files 2 3The implementation of leveldb is similar in spirit to the representation of a 4single [Bigtable tablet (section 5.3)](https://research.google/pubs/pub27898/). 5However the organization of the files that make up the representation is 6somewhat different and is explained below. 7 8Each database is represented by a set of files stored in a directory. There are 9several different types of files as documented below: 10 11### Log files 12 13A log file (*.log) stores a sequence of recent updates. Each update is appended 14to the current log file. When the log file reaches a pre-determined size 15(approximately 4MB by default), it is converted to a sorted table (see below) 16and a new log file is created for future updates. 17 18A copy of the current log file is kept in an in-memory structure (the 19`memtable`). This copy is consulted on every read so that read operations 20reflect all logged updates. 21 22## Sorted tables 23 24A sorted table (*.ldb) stores a sequence of entries sorted by key. Each entry is 25either a value for the key, or a deletion marker for the key. (Deletion markers 26are kept around to hide obsolete values present in older sorted tables). 27 28The set of sorted tables are organized into a sequence of levels. The sorted 29table generated from a log file is placed in a special **young** level (also 30called level-0). When the number of young files exceeds a certain threshold 31(currently four), all of the young files are merged together with all of the 32overlapping level-1 files to produce a sequence of new level-1 files (we create 33a new level-1 file for every 2MB of data.) 34 35Files in the young level may contain overlapping keys. However files in other 36levels have distinct non-overlapping key ranges. Consider level number L where 37L >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB 38for level-1, 100MB for level-2, ...), one file in level-L, and all of the 39overlapping files in level-(L+1) are merged to form a set of new files for 40level-(L+1). These merges have the effect of gradually migrating new updates 41from the young level to the largest level using only bulk reads and writes 42(i.e., minimizing expensive seeks). 43 44### Manifest 45 46A MANIFEST file lists the set of sorted tables that make up each level, the 47corresponding key ranges, and other important metadata. A new MANIFEST file 48(with a new number embedded in the file name) is created whenever the database 49is reopened. The MANIFEST file is formatted as a log, and changes made to the 50serving state (as files are added or removed) are appended to this log. 51 52### Current 53 54CURRENT is a simple text file that contains the name of the latest MANIFEST 55file. 56 57### Info logs 58 59Informational messages are printed to files named LOG and LOG.old. 60 61### Others 62 63Other files used for miscellaneous purposes may also be present (LOCK, *.dbtmp). 64 65## Level 0 66 67When the log file grows above a certain size (4MB by default): 68Create a brand new memtable and log file and direct future updates here. 69 70In the background: 71 721. Write the contents of the previous memtable to an sstable. 732. Discard the memtable. 743. Delete the old log file and the old memtable. 754. Add the new sstable to the young (level-0) level. 76 77## Compactions 78 79When the size of level L exceeds its limit, we compact it in a background 80thread. The compaction picks a file from level L and all overlapping files from 81the next level L+1. Note that if a level-L file overlaps only part of a 82level-(L+1) file, the entire file at level-(L+1) is used as an input to the 83compaction and will be discarded after the compaction. Aside: because level-0 84is special (files in it may overlap each other), we treat compactions from 85level-0 to level-1 specially: a level-0 compaction may pick more than one 86level-0 file in case some of these files overlap each other. 87 88A compaction merges the contents of the picked files to produce a sequence of 89level-(L+1) files. We switch to producing a new level-(L+1) file after the 90current output file has reached the target file size (2MB). We also switch to a 91new output file when the key range of the current output file has grown enough 92to overlap more than ten level-(L+2) files. This last rule ensures that a later 93compaction of a level-(L+1) file will not pick up too much data from 94level-(L+2). 95 96The old files are discarded and the new files are added to the serving state. 97 98Compactions for a particular level rotate through the key space. In more detail, 99for each level L, we remember the ending key of the last compaction at level L. 100The next compaction for level L will pick the first file that starts after this 101key (wrapping around to the beginning of the key space if there is no such 102file). 103 104Compactions drop overwritten values. They also drop deletion markers if there 105are no higher numbered levels that contain a file whose range overlaps the 106current key. 107 108### Timing 109 110Level-0 compactions will read up to four 1MB files from level-0, and at worst 111all the level-1 files (10MB). I.e., we will read 14MB and write 14MB. 112 113Other than the special level-0 compactions, we will pick one 2MB file from level 114L. In the worst case, this will overlap ~ 12 files from level L+1 (10 because 115level-(L+1) is ten times the size of level-L, and another two at the boundaries 116since the file ranges at level-L will usually not be aligned with the file 117ranges at level-L+1). The compaction will therefore read 26MB and write 26MB. 118Assuming a disk IO rate of 100MB/s (ballpark range for modern drives), the worst 119compaction cost will be approximately 0.5 second. 120 121If we throttle the background writing to something small, say 10% of the full 122100MB/s speed, a compaction may take up to 5 seconds. If the user is writing at 12310MB/s, we might build up lots of level-0 files (~50 to hold the 5*10MB). This 124may significantly increase the cost of reads due to the overhead of merging more 125files together on every read. 126 127Solution 1: To reduce this problem, we might want to increase the log switching 128threshold when the number of level-0 files is large. Though the downside is that 129the larger this threshold, the more memory we will need to hold the 130corresponding memtable. 131 132Solution 2: We might want to decrease write rate artificially when the number of 133level-0 files goes up. 134 135Solution 3: We work on reducing the cost of very wide merges. Perhaps most of 136the level-0 files will have their blocks sitting uncompressed in the cache and 137we will only need to worry about the O(N) complexity in the merging iterator. 138 139### Number of files 140 141Instead of always making 2MB files, we could make larger files for larger levels 142to reduce the total file count, though at the expense of more bursty 143compactions. Alternatively, we could shard the set of files into multiple 144directories. 145 146An experiment on an ext3 filesystem on Feb 04, 2011 shows the following timings 147to do 100K file opens in directories with varying number of files: 148 149 150| Files in directory | Microseconds to open a file | 151|-------------------:|----------------------------:| 152| 1000 | 9 | 153| 10000 | 10 | 154| 100000 | 16 | 155 156So maybe even the sharding is not necessary on modern filesystems? 157 158## Recovery 159 160* Read CURRENT to find name of the latest committed MANIFEST 161* Read the named MANIFEST file 162* Clean up stale files 163* We could open all sstables here, but it is probably better to be lazy... 164* Convert log chunk to a new level-0 sstable 165* Start directing new writes to a new log file with recovered sequence# 166 167## Garbage collection of files 168 169`RemoveObsoleteFiles()` is called at the end of every compaction and at the end 170of recovery. It finds the names of all files in the database. It deletes all log 171files that are not the current log file. It deletes all table files that are not 172referenced from some level and are not the output of an active compaction. 173