xfs-delayed-logging-design.rst - OpenGrok cross reference for /linux-6.14.4/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst

Lines Matching full:to
18 reservations bounds. At this point we need to explain how relogging works. With
26 XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata
29 physical logging mechanisms to provide the necessary recovery guarantees the
33 details logged are made up of the changes to in-core structures rather than
40 The reason for these differences is to keep the amount of log space and CPU time
41 required to process objects being modified as small as possible and hence the
46 The method used to log an item or chain modifications together isn't
47 particularly important in the scope of this document. It suffices to know that
51 followed to guarantee forwards progress and prevent deadlocks.
63 The type and size of reservation must be matched to the modification taking
72 	<join item to transaction>
78 resources joined to it are released, along with the remaining unused reservation
97 While this might look similar to a one-shot transaction, there is an important
107 transactions is running, nothing else can read from or write to the inode and
108 this provides a mechanism for complex changes to appear atomic from an external
111 It is important to note that a series of rolling transactions in a permanent
114 way through, then recovery will only replay up to the last transactional
115 modification the loop made that was committed to the journal.
117 This affects long running permanent transactions in that it is not possible to
120 if a long running operation requires multiple transactions to fully complete,
121 the high level operation must use intents and deferred operations to guarantee
131 to stable storage when it returns. Hence when a system crashes, not all the
136 that were committed prior to that change will also be seen.
138 For single shot operations that need to reach stable storage immediately, or
141 a "log force" to flush the outstanding committed transactions to stable storage
142 in the journal and wait for that to complete.
145 throughput to the IO latency limitations of the underlying storage. Instead, we
146 tend to use log forces to ensure modifications are on stable storage only when
147 a user operation requires a synchronisation point to occur (e.g. fsync).
153 It has been mentioned a number of times now that the logging subsystem needs to
155 because it can't be written to the journal due to a lack of space in the
161 available to write the modification into the journal before we start making
162 modifications to objects and items. As such, the reservation needs to be large
163 enough to take into account the amount of metadata that the change might need to
165 transaction, we have to reserve enough space to record a full leaf-to-root split
166 of the btree. As such, the reservations are quite complex because we have to
173 again.  Then we might have to update reverse mappings, which modifies yet
179 log has this much space available before the transaction is allowed to proceed
180 so that when we come to write the dirty metadata into the log we don't run out
184 required for the transaction to proceed. For permanent transactions, however, we
185 also have a "log count" that affects the size of the reservation that is to be
189 reservation, it is somewhat inefficient to do this as it requires the
190 transaction rolling mechanism to re-reserve space on every transaction roll. We
192 rolls are likely for the common modifications that need to be made.
194 For example, an inode allocation is typically two transactions - one to
195 physically allocate a free inode chunk on disk, and another to allocate an inode
197 transaction, we might set the reservation log count to a value of 2 to indicate
203 reservation is increased from a single unit reservation to multiple unit
205 means we can roll the transaction multiple times before we have to re-reserve
207 modifications we make only need to reserve log space once.
209 If the log count for a permanent transaction reaches zero, then it needs to
217 The position in the log is typically referred to as a Log Sequence Number (LSN).
222 bytes). Hence we can do realtively simple LSN based math to keep track of
235 into the log rather than basic blocks. Hence it technically isn't using LSNs to
239 The reserve grant head is used to accurately account for exact transaction
241 and need to write into the log. The reserve head is used to prevent new
257 exhausted. At this point, we still require a log space reservation to continue
259 sleep during the transaction commit process waiting for new log space to become
262 enough free space in the log to fulfill all of the pending reservations and
265 To take a new reservation without sleeping requires us to be able to take a
267 we need to be able to *overcommit* the log reservation space. As has already
273 to remove the overcommit and start taking new reservations. In other words, we
279 the write head - is then reserved separately by a call to xfs_log_reserve()
281 physical log space to be reserved from the write grant head, but only if one
288 attached to the transaction chain being rolled are always relocated to the
291 deadlock the log as we cannot take the locks needed to write back that item and
292 move the tail of the log forwards to free up write grant space. Re-logging the
299 (eventually) made available to permanent transactions no matter how many times
306 XFS allows multiple separate modifications to a single object to be carried in
307 the log at any given time.  This allows the log to avoid needing to flush each
308 change to disk before recording a new change to the object. XFS does this via a
310 is that any new change to the object is recorded with a *new copy* of all the
311 existing changes in the new transaction that is written to the log.
313 That is, if we have a sequence of changes A through to F, and the object was
314 written to disk after change D, we would see in the log the following series
323 	    <object written to disk>
330 This relogging technique allows objects to be moved forward in the log so that
333 of each subsequent transaction, and it's the technique that allows us to
344 Hence it can be seen that the relogging operation is fundamental to the correct
346 people should be able to see why the XFS metadata operations writes so much to
347 the log - repeated operations to the same objects write the same changes to
348 the log over and over again. Worse is the fact that objects tend to get
353 hand in hand. That is, transactions don't get written to the physical journal
356 transactions to disk. This means that XFS is doing aggregation of transactions
357 in memory - batching them, if you like - to minimise the impact of the log IO on
363 to 256kB by use of a mount option.
366 that can be made to the filesystem at any point in time - if all the log
368 the current batch completes. It is now common for a single current CPU core to
369 be to able to issue enough transactions to keep the log buffers full and under
370 IO permanently. Hence the XFS journalling subsystem can be considered to be IO
376 The key thing to note about the asynchronous logging combined with the
378 multiple times before they are committed to disk in the log buffers. If we
379 return to the previous relogging example, it is entirely possible that
380 transactions A through D are committed to disk in the same log buffer.
383 but only one of those copies needs to be there - the last one "D", as it
388 buffers. It is clear that reducing the number of stale objects written to the
389 log would greatly reduce the amount of metadata we write to the log, and this
394 logical to physical formatting to do the relogging because there is no
395 infrastructure to keep track of logical changes in memory prior to physically
396 formatting the changes in a transaction to the log buffer. Hence we cannot avoid
399 Delayed logging is the name we've given to keeping and tracking transactional
400 changes to objects in memory outside the log buffer infrastructure. Because of
401 the relogging concept fundamental to the XFS journalling subsystem, this is
402 actually relatively easy to do - all the changes to logged items are already
403 tracked in the current infrastructure. The big problem is how to accumulate
404 them and get them to the log in a consistent, recoverable manner.
408 One of the key changes that delayed logging makes to the operation of the
412 written to the log at any point in time, there may be a much greater amount
421 need to ensure application level data integrity is maintained.
424 warrants rigorous proofs to determine whether it is correct or not. The method
425 of accumulating changes in memory for some period before writing them to the
427 no time is spent in this document trying to convince the reader that the
433 	1. Reduce the amount of metadata written to the log by at least
435 	2. Supply sufficient statistics to validate Requirement #1.
436 	3. Supply sufficient new tracing infrastructure to be able to debug
449 existing log item dirty region tracking) is that when it comes to writing the
450 changes to the log buffers, we need to ensure that the object we are formatting
451 is not changing while we do this. This requires locking the object to prevent
452 concurrent modification. Hence flushing the logical changes to the log would
453 require us to lock every object, format them, and then unlock them again.
457 the delayed logging tracking lock to commit the transaction. However, the
459 trying to get the lock on object A to flush it to the log buffer. This appears
460 to be an unsolvable deadlock condition, and it was solving this problem that
461 was the barrier to implementing delayed logging for so long.
463 The solution is relatively simple - it just took a long time to recognise it.
464 Put simply, the current logging code formats the changes to each item into an
465 vector array that points to the changed regions in the item. The log write code
466 simply copies the memory these vectors point to into the log buffer during
469 allocated memory buffer big enough to fit the formatted vector.
471 If we then copy the vector into the memory buffer and rewrite the vector to
472 point to the memory buffer rather than the object itself, we now have a copy of
474 that does not require us to lock the item to access. This formatting and
477 without needing to lock the owning item.
479 Hence we avoid the need to lock items when we need to flush outstanding
480 asynchronous transactions to the log. The differences between the existing
509 The memory buffer and associated vector need to be passed as a single object,
510 but still need to be associated with the parent object so if the object is
515 buffer is to support splitting vectors across log buffer boundaries correctly.
519 change and as such is not desirable.  It also means we'd have to write the log
521 region state that needs to be placed into the headers during the log write.
523 Hence we need to keep the vector, but by attaching the memory buffer to it and
524 rewriting the vector addresses to point at the memory buffer we end up with a
525 self-describing object that can be passed to the log buffer write code to be
527 Hence we avoid needing a new on-disk format to handle items that have been
535 them to be used without limitations, we need to be able to track and accumulate
536 them so that they can be written to the log at some later point in time.  The
537 log item is the natural place to store this vector and buffer, and also makes sense
538 to be the object that is used to track committed objects as it will always
541 The log item is already used to track the log items that have been written to
542 the log but not yet written to disk. Such log items are considered "active"
545 completion, after which they are unpinned and can be written to disk. An object
546 that is in the AIL can be relogged, which causes the object to be pinned again
551 and relogged, so any tracking must be separate to the AIL infrastructure. As
557 Similar to the AIL, tracking of committed items is done through a new list
559 committed and have formatted memory buffers attached to them. It tracks objects
562 and done to make it easy for debugging - the last items in the list are the
573 We need to write these items in the order that they exist in the CIL, and they
574 need to be written as an atomic transaction. The need for all the objects to be
581 To fulfill this requirement, we need to write the entire CIL in a single log
585 reason for this limit is that to find the head and tail of the log, there must
591 size of a checkpoint to be slightly less than a half the log.
594 to any other transaction - it contains a transaction header, a series of
598 might need to tune the recovery transaction object hash size.
600 Because the checkpoint is just another transaction and all the changes to log
602 code to write the changes into the log. To do this efficiently, we need to
604 transaction. The current log write code enables us to do this easily with the
606 the transaction commit record, but tracking this requires us to have a
607 per-checkpoint context that travels through the log write process through to
611 checkpoint from initiation to checkpoint completion. A new context is initiated
615 context and attach that to the CIL for aggregation of new transactions.
617 This allows us to unlock the CIL immediately after transfer of all the
618 committed items and effectively allows new transactions to be issued while we
620 checkpoints to be written into the log buffers in the case of log force heavy
625 To ensure that we can be writing an item into a checkpoint transaction at
628 to store the list of log vectors that need to be written into the transaction.
629 Hence log vectors need to be able to be chained together to allow them to be
631 buffer and log vector attached to each log item needs to be attached to the
679 start, while the checkpoint flush code works over the log vector chain to
683 attached to the log buffer that the commit record was written to along with a
689 Discussion Point: I am uncertain as to whether the log item is the most
690 efficient way to track vectors, even though it seems like the natural way to do
691 it. The fact that we walk the log items (in the CIL) just to chain the log
694 the log vector chaining. If we track by the log vectors, then we only need to
707 This allows transactions to be issued asynchronously even though there may be
709 committed to the log. In the rare case that a dependent operation occurs (e.g.
711 force can be issued to force the dependent transaction to disk immediately.
713 To do this, transactions need to record the LSN of the commit record of the
721 contexts, and as such it is simple to assign a sequence number to each
723 atomically, it is simple to ensure that each new context has a monotonically
724 increasing sequence number assigned to it without the need for an external
726 one to it for the new context.
728 Then, instead of assigning a log buffer LSN to the transaction commit LSN
731 checkpoint sequence needs to be committed before they can continue. As a
732 result, the code that forces the log to a specific LSN now needs to ensure that
733 the log forces to a specific checkpoint.
735 To ensure that we can do this, we need to track all the checkpoint contexts
736 that are currently committing to the log. When we flush a checkpoint, the
737 context gets added to a "committing" list which can be searched. When a
741 using the existing log force mechanisms to execute synchronous forces.
743 It should be noted that the synchronous forces may need to be extended with
744 mitigation algorithms similar to the current log buffer code to allow
749 The main concern with log forces is to ensure that all the previous checkpoints
750 are also committed to disk before the one we need to wait for. Therefore we
751 need to check that all the prior contexts in the committing list are also
752 complete before waiting on the one we need to complete. We do this
753 synchronisation in the log force code so that we don't need to wait anywhere
756 The only remaining complexity is that a log force now also has to handle the
758 is, we need to flush the CIL and potentially wait for it to complete. This is a
759 simple addition to the existing log forcing code to check the sequence numbers
762 transactions to remain untouched (i.e. commit an asynchronous transaction, then
770 transaction. We don't know how big a checkpoint transaction is going to be
771 ahead of time, nor how many log buffers it will take to write out, nor the
772 number of split log vector regions are going to be used. We can track the
773 amount of log space required as we add items to the commit item list, but we
774 still need to reserve the space in the log for the checkpoint.
788 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
789 vector is 12 bytes, so the total to be logged is approximately 1.75MB. In
794 not particularly flexible and is difficult to select the "optimal value" for
797 Further, if we are going to use a static reservation, which bit of the entire
801 relogged. This allows for a checkpoint reservation to only have to account for
806 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
807 large enough to handle arbitrary sized checkpoint transactions. This
808 reservation needs to be made before the checkpoint is started, and we need to
809 be able to reserve the space without sleeping.  For a 8MB checkpoint, we need a
812 A static reservation needs to manipulate the log grant counters - we can take a
813 permanent reservation on the space, but we still need to make sure we refresh
814 the write reservation (the actual space available to the transaction) after
818 The problem with this is that it can lead to deadlocks as we may need to commit
819 checkpoints to be able to free up log space (refer back to the description of
821 space available in the log if we are to use static reservations, and that is
822 very difficult and complex to arrange. It is possible to do, but there is a
826 items in the CIL and using this to dynamically calculate the amount of log
833 will always be less than or equal to the maximal amount in the reservation.
836 are added to the CIL and avoid the need for reserving and regranting log space
840 As mentioned early, transactions can't grow to more than half the size of the
841 log. Hence as part of the reservation growing, we need to also check the size
843 the maximum threshold, we need to push the CIL to the log. This is effectively
844 a "background flush" and is done on demand. This is identical to
846 checkpoint commit to complete. This background push is checked and executed by
851 force will push the CIL to disk, and if the transaction subsystem stays idle,
852 allow the idle log to be covered (effectively marked clean) in exactly the same
854 whether this log force needs to be done more frequently than the current rate
864 that items get pinned once for every transaction that is committed to the log
872 For delayed logging, however, we have an asymmetric transaction commit to
875 That is, we now have a many-to-one relationship between transaction commit and
880 To keep pin/unpin symmetry, the algorithm needs to change to a "pin on
882 pinning and unpinning becomes symmetric around a checkpoint context. We have to
887 value according to its context.
889 Just to make matters slightly more complex, this checkpoint level context
897 lock to guarantee that we pin the items correctly.
903 commits must scale to many concurrent commits. The current transaction commit
908 As a result, the delayed logging transaction commit code needs to be designed
913 	2. Adding items to the CIL and updating item space accounting
917 that we have a many-to-one interaction here. That is, the only restriction on
918 the number of concurrent transactions that can be trying to commit at once is
923 The amount of time a transaction commit needs to hold out a flush is a
924 relatively long period of time - the pinning of log items needs to be done
928 separately to the pinning of objects could be used to reduce the hold time of
932 really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
941 compared to transaction commit for asynchronous transaction workloads - only
943 transaction commit concurrency due to cache line bouncing of the lock on the
948 concurrently, the CIL needs to be protected separately from the above
949 commit/flush exclusion. It also needs to be an exclusive lock but it is only
959 checkpoints and needs to block waiting for checkpoints to complete their commit
961 sequencing also requires the same lock, list walk, and blocking mechanism to
966 sequencing needs to wait until checkpoint contexts contain a commit LSN
968 sequencing needs to wait until previous checkpoint contexts are removed from
970 broadcast wakeups (thundering herds) has been used to implement these two
974 given separate wait lists to reduce lock contention and the number of processes
986 	4. Join item to transaction
989 			Attach log item to owner item
990 		Attach log item to transaction
998 		Attach transaction to log buffer
1011 		Flush item to disk
1031 	4. Join item to transaction
1034 			Attach log item to owner item
1035 		Attach log item to transaction
1041 		Attach log vector and buffer to log item
1055 		attach checkpoint context to log buffer
1068 		Flush item to disk
1077 committing of the log items to the log itself and the completion processing.
1082 and the design of the internal structures to avoid on disk format changes, we
1085 be able to swap methods automatically and transparently depending on load