vdo-design.rst - OpenGrok cross reference for /linux-6.14.4/Documentation/admin-guide/device-mapper/vdo-design.rst

Lines Matching full:to
9 can be backed by up to 256TB of storage, and can present a logical size of
10 up to 4PB. This target was originally developed at Permabit Technology
18 deduplication rates of 254:1, i.e. up to 254 copies of a given 4K block can
26 problem. The first is to recognize duplicate data. The second is to avoid
28 parts: a deduplication index (called UDS) that is used to discover
30 maps from logical block addresses to the actual storage location of the
36 Due to the complexity of data optimization, the number of metadata
37 structures involved in a single write operation to a vdo target is larger
39 block sizes in order to achieve good deduplication rates, acceptable
41 design attempts to be lock-free.
43 Most of a vdo's main data structures are designed to be easily divided into
46 normal operation, each zone is assigned to a specific thread, and only that
49 request object (the "data_vio") which will be added to a work queue when
50 the next phase of its operation requires access to the structures in the
66 In order to identify duplicate data efficiently, vdo was designed to
69 sets with significant amounts of duplicate data, the duplicates tend to
73 temporal order. The second insight is that new data is more likely to
74 duplicate recent data than it is to duplicate older data and in general,
75 there are diminishing returns to looking further back in time. Therefore,
76 when the index is full, it should cull its oldest records to make space for
78 ultimate goal of deduplication is to reduce storage costs. Since there is a
79 trade-off between the storage saved and the resources expended to achieve
80 those savings, vdo does not attempt to find every last duplicate block. It
81 is sufficient to find and eliminate most of the redundancy.
83 Each block of data is hashed to produce a 16-byte block name. An index
85 that data on the underlying storage. However, it is not possible to
87 because it is too costly to update the index when a block is over-written
89 with the blocks, which is difficult to do efficiently in block-based
95 index as hints, and reads each indicated block to verify that it is indeed
98 Records are collected into groups called chapters. New records are added to
103 created to collect new records.
105 Closing a chapter converts it to a different format which is optimized for
106 reading. The records are written to a series of record pages based on the
108 locality should be on a small number of pages, reducing the I/O required to
112 without having to load the entire chapter from storage. This index uses
114 index entry refers to the desired block name. It can only guarantee that if
119 Once enough records have been written to fill up all the available index
120 space, the oldest chapter is removed to make space for new chapters. Any
125 In order to find records in older chapters, the index also maintains a
127 mapping each block name to the chapter containing its newest record. This
135 in memory and is saved to storage only when the vdo target is shut down.
139 the name is new, or which chapter to search. If it returns a chapter, the
141 that the name is new, or which record page to search. Finally, if it is not
143 This process may require up to two page reads per request (one for the
152 Because we expect the hashes to be randomly distributed, the size of the
154 the deltas are expressed using a Huffman code to take up even less space.
156 allows the index to use many fewer bytes per entry than a traditional hash
157 table, but it is slightly more expensive to look up entries, because a
158 request must read every entry in a delta list to add up the deltas in order
159 to find the record it needs. The delta index reduces this lookup cost by
163 The default index size can hold 64 million records, corresponding to about
167 than that, the index will not be able to find it because the records of the
169 200 GB file to a vdo target and then immediately writes it again, the two
176 deduplication beyond the 256GB threshold, vdo can be configured to use a
179 later. It is important to consider the expected workload for a vdo target
180 before configuring it.)  There are two ways to do this.
182 One way is to increase the memory size of the index, which also increases
187 The other option is to enable sparse indexing. Sparse indexing increases
199 A vio (short for Vdo I/O) is conceptually similar to a bio, with additional
200 fields and data to track vdo-specific information. A struct vio maintains a
201 pointer to a bio but also tracks other fields specific to the operation of
203 circumstances where vdo completes the bio but must continue to do work
204 related to deduplication or compression.
208 called a data_vio to track information about their progress. A struct
210 related to deduplication and other vdo features. The data_vio is the
212 set of steps to handle the application data, after which it is reset and
213 returned to a pool of data_vios for reuse.
215 There is a fixed pool of 2048 data_vios. This number was chosen to bound
216 the amount of work that is required to recover from a crash. In addition,
224 work in concert to reduce or amortize metadata updates across as many data
229 Most of the vdo volume belongs to the slab depot. The depot contains a
230 collection of slabs. The slabs can be up to 32GB, and are divided into
232 These blocks are used either to store data, or to hold portions of the
233 block map (see below). In addition to the data blocks, each slab has a set
237 Reference updates are written to the slab journal. Slab journal blocks are
239 requests they do so in order to allow the main recovery journal (see below)
240 to free up space. The slab journal is used both to ensure that the main
241 recovery journal can regularly free up space, and also to amortize the cost
244 when there is a need to reclaim slab journal space. The write operations
245 are performed in the background as needed so they do not add latency to
248 Each slab is independent of every other. They are assigned to "physical
250 is assigned to zone n mod P.
253 summary," which is used to reduce the amount of work needed to come back
256 reference count updates have been persisted to storage, and approximately
257 how full it is. During recovery, each physical zone will attempt to recover
266 The block map contains the logical to physical mapping. It can be thought
269 the given logical address. The other 4 bits are used to indicate the nature
272 one represents an uncompressed block, and the other 14 states are used to
280 There are 60 radix trees which are assigned to "logical zones" in round
281 robin fashion. (If there are L logical zones, tree n will belong to zone n
283 0-811 belong to tree 0, logical addresses 812-1623 belong to tree 1, and so
284 on. The interleaving is maintained all the way up to the 60 root nodes.
289 need to pre-allocate space for the entire set of logical mappings and also
292 In operation, the block map maintains two caches. It is prohibitive to keep
296 time, and is large enough to hold all the non-leaf pages of the entire
301 The recovery journal is used to amortize updates across the block map and
302 slab depot. Each write request causes an entry to be made in the journal.
311 before each journal block write to ensure that the physical data for the
313 writes are all issued with the FUA bit set to ensure the recovery journal
316 be updated to reflect the operation. These entries allow the vdo device to
317 reconstruct the logical to physical mappings after an unexpected
322 All write I/O to vdo is asynchronous. Each bio will be acknowledged as soon
323 as vdo has done enough work to guarantee that it can complete the write
326 requires data to be stable on storage, it must issue a flush or write the
335     to the application. The data_vio pool is protected by a spin lock.
341     will no longer have access to the application bio. The application bio
347     of the bio. It is vital to prevent simultaneous modifications of the
350     the logical address and the value is a pointer to the data_vio
355     operation finishes. It also sends a message to inform the current
358     compression packer (step 8d) rather than allowing it to continue
359     waiting to be packed.
361     This stage requires the data_vio to get an implicit lock on the
362     appropriate logical zone to prevent concurrent modifications of the
366 3.  The data_vio traverses the block map tree to ensure that all the
367     necessary internal tree nodes have been allocated, by trying to find
370     pool used to store application data.
374        data_vio to lock the page-node that needs to be allocated. This
376        that causes other data_vios to wait for the allocation process to
380        happening, in order to allow other operations in the same logical
381        zone to proceed. The details of allocation are the same as in
382        step 4. Once a new node has been allocated, that node is added to
383        the tree using a similar process to adding a new data block mapping.
384        The data_vio journals the intent to add the new node to the block
386        (step 11), and reacquires the implicit logical zone lock to add the
387        new mapping to the parent tree node (step 12). Once the tree is
395        to traverse the tree again. The data_vio then releases the implicit
398 4.  If the block is a zero block, skip to step 9. Otherwise, an attempt is
399     made to allocate a free data block. This allocation ensures that the
402     physical zone to search for free space within that zone.
406     it will proceed to search the next physical zone by taking the implicit
408     free block or runs out of zones to search. The data_vio will acquire a
410     struct pbn_lock also has several fields to record the various kinds of
412     added to a hashtable like the logical block locks in step 2. This
414     reference count of the free block is updated to prevent any other
420     needs to complete the write. The application bio can safely be
422     thread to prevent the application callback from blocking other data_vio
425     If an allocation could not be obtained, the data_vio continues to
426     attempt to deduplicate or compress the data, but the bio is not
429 6.  At this point vdo must determine where to store the application data.
435     tracked in a hashtable similar to the way logical block locks are
440     data_vio obtains a hash lock from the pool, adds it to the hashtable,
443     do all the work to decide where the application data will be
446     data_vio will wait for the agent to complete its work and then share
450     the data does not match the hash lock's agent, the data_vio skips to
451     step 8h and attempts to write its data directly. This can happen if two
454 8.  The hash lock agent attempts to deduplicate or compress its data with
458        (struct uds_request) to the deduplication index. This does not
459        require the data_vio to get any locks because the index components
463     b. If the deduplication index returns advice, the data_vio attempts to
465        order to read the data and verify that it is the same as the
468        that address may soon be overwritten so it is not safe to use the
473        physical block as their new physical address and proceed to step 9
474        to record their new mapping. If there are more data_vios in the hash
476        data_vios becomes the new agent and continues to step 8d as if no
481        to. If the agent does not have an allocation, some other data_vio in
484        are out of space, so they proceed to step 13 for cleanup.
486     e. The agent attempts to compress its data. If the data does not
487        compress, the data_vio will continue to step 8h to write its data
491        implicit hash zone lock and go to the packer (struct packer) where
496        The packer can combine up to 14 compressed blocks in a single 4k
500        to fill out the compressed block. There is a mechanism for vdo to
501        evict waiting data_vios when continuing to wait would cause
503        flush, device shutdown, or a subsequent data_vio trying to overwrite
506        before more compressible blocks need to use its bin. An evicted
507        data_vio will proceed to step 8h to write its data directly.
517        modified to be a shared lock. Then it releases the implicit physical
518        zone lock and proceeds to step 8i.
521        step 3. It will write its data to that allocated physical block.
526        possible. Each data_vio will then proceed to step 9 to record its
530        the deduplication index is updated to reflect the location of the
535     because there are usually too many block map leaf nodes to store
548     recovery blocks up to the one containing its entry have been written
549     and flushed to ensure the transaction is stable on storage.
560     order to avoid underflow. After each slab journal entry is made in
565     logical-to-physical mapping in the block map to point to the new
569     lock and releases its hash lock to the pool.
574     that block back to zero to free it for use by subsequent data_vios.
580     acknowledged, and the data_vio is returned to the pool.
585 1 and 2 in the write path to obtain a data_vio and lock its logical
587 address that is guaranteed to complete, the read data_vio will copy the
589 logical-to-physical mapping by traversing the block map tree as in step 3,
595 acknowledgment as in step 13, although it only needs to release the logical
596 lock and return itself to the pool.
605 this operation are nearly identical to the normal read and write
610 When a vdo is restarted after a crash, it will attempt to recover from the
614 into the slab journals. Finally, each physical zone attempts to replay at
615 least one slab journal to reconstruct the reference counts of one slab.
618 used to reconstruct the rest of the reference counts in the background.
624 lost. The vdo may be instructed to rebuild as best it can in order to
625 return to a writable state. However, this is never done automatically due
626 to the possibility that data has been lost. During a read-only rebuild, the
632 consistent with each other. This allows vdo to resume normal operation and