1
2==========
3ublk-qcow2
4==========
5
6Motivation
7==========
8
9ublk-qcow2 is started for serving for the four purposes:
10
11- building one complicated target from scratch helps libublksrv APIs/functions
12 become mature/stable more quickly, since qcow2 is complicated and needs more
13 requirement from libublksrv compared with other simple ones(loop, null)
14
15- there are several attempts of implementing qcow2 driver in kernel, such as
16 ``qloop`` [#qloop]_, ``dm-qcow2`` [#dm_qcow2]_ and
17 ``in kernel qcow2(ro)`` [#in_kernel_qcow2_ro]_, so ublk-qcow2 might useful
18 for covering requirement in this field
19
20- performance comparison with qemu-nbd, and it was my 1st thought to evaluate
21 performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
22 is started
23
24- help to abstract common building block or design pattern for writing new ublk
25 target/backend
26
27Howto
28=====
29
30ublk add -t qcow2 -f $PATH_QCOW2_IMG
31
32So far not add any command line options yet. The default L2 cache size is 1MB,
33and default refcount cache size is 256KB. Both l2 and refcount slice size is
344K. With DEBUG_QCOW2_META_STRESS enabled, two l2 slices and refcount slices
35are allowed, and ublk-qcow2 is verified with this minimum cache size setting.
36
37
38Design
39======
40
41Based on ublk framework
42-----------------------
43
44Based on libublksrv and common target code
45
46IO size
47-------
48
49For simplifying handling of cluster mapping, the chunk_sectors of block layer
50queue limit is aligned with QCOW2's cluster size, this way guarantees that at
51most one l2 lookup is needed for handling one ublk-qcow2 IO, meantime one time
52of IO is enough to handling one ublk-qcow2 IO. But this way may hurt big chunk
53sequential IO a bit. In future, the chunk_sectors may be increased to 512KB,
54then it is enough to load L2 slice at most once for handling one ublk IO, but
55this big IO needs to be splitted to at most 512K/cluster_size small IOs.
56
57
58Async io
59--------
60
61Target/backend is implemented by io_uring only, and shares same io_uring
62for handling both ublk io command and qcow2 IOs.
63
64Any IO from ublk driver has one unique tag, and any meta IO is assigned by one
65tag from ublk-qcow2 too. Each IO(includes meta IO) is handled in one coroutine
66context, so coroutine is always bound with one unique IO tag. IO is always
67submitted via io_uring in async style, then the coroutine is suspended after
68the submission. Once the IO is completed, the coroutine is resumed for further
69processing.
70
71Metadata update
72---------------
73
74soft update approach is taken for maintaining qcow2 meta-data integrity in the
75event of a crash or power outage.
76
77All metadata is updated asynchronously.
78
79- meta entry dependency on cluster
80
81 When one entry of l1/refcount table/l2/refcount blk table needs to be
82 updated: 1) if the pointed cluster needs to be allocated, the entry is
83 updated after the allocated cluster is discarded/zeroed, then any
84 following reading on this mapping will get correct data. During the
85 period, any read on any sectors in this cluster will return zero, and
86 any write IO won't be started until the entry is updated. So cluster
87 discard/zeroed is always done before updating meta entry pointing to
88 this cluster and writing io data to any sector in this cluster.
89
90- io data writing depends on zeroed cluster
91
92 If the cluster isn't zeroed, the io write has to wait until the zeroing
93 is done; the io read has to return zero during the period of zeroing
94 cluster
95
96- L2/refcount blk entry can be writeback iff the pointed cluster is zeroed
97
98 Meantime the cluster for holding the table needs to be zeroed too
99
100- L1 entry depends on l2 table(cache slice)
101
102 The L1 dirty entry can only be updated iff the pointed l2 table becomes
103 clean, that means: 1) the pointed cluster needs to be zeroed; 2) all dirty
104 slices need to be updated
105
106- refcount table entry depends on refcount blk
107
108 The refcount table dirty entry can only be updated iff the pointed refcount
109 blk becomes clean, that means: 1) the pointed cluster needs to be zeroed; 2)
110 all dirty slices need to be updated
111
112
113Meta data flushing to image
114---------------------------
115
116When any meta(L1/L2/refcount table/refcount blk) is being flushed to image,
117IO code path can't update the in-ram meta data until the meta is flushed to
118image, when the dirty flag is cleared.
119
120Any meta is always flushed in background:
121
122- when cache slice is added to dirty list, these cache slices will be started
123 to flush after all current IOs are handled
124
125- meta data flushing when io_uring is idle
126
127- periodic meta data flushing
128
129How to flushing meta data
130~~~~~~~~~~~~~~~~~~~~~~~~~
131
1321) allocate one tag for flushing one meta chain, and soft update has to be
133 respected, start from the lowest cluster zeroing IO to the upper layer of
134 updating l1 or refcount table
135
1362) from implementation viewpoint, find the meta flush chains from top to bottom
137
138 - find one oldest dirty entry in top meta(l1 or refcount table) or
139 specified index(flushing from slice dirty list), suppose the index is A,
140 then figure out all dirty entries in the 512 byte range which includes
141 index A
142
143 - for each dirty entry in the candidates
144 -- for each dirty slices in this cluster pointed by the dirty entry,
145 check if any pointed cluster by the slice is zeroed, if there is any,
146 wait until all clusters are zeroed
147
148 -- figure out the pointed cluster, if the cluster isn't zeroed yet,
149 zero it now
150
151 -- flushing all dirty slices in this cluster
152
153 - flush all meta entries in this 512byte area
154
155How to retrieve meta object after the meta io is done
156-----------------------------------------------------
157
158- use add_meta_io/del_meta_io/get_meta_io to meta flushing
159
160
161L2/refcount blk slice lifetime
162------------------------------
163
164- meta slice idea is from QEMU, and both l2/refcount block table takes one
165 cluster, and slice size is configurable, and at default both l2 &
166 refcount block slice is 4K, so either one l2 mapping is needed or
167 refcount block meta is needed, just the 4k part is loaded from image,
168 and when flushing slice to image, it is still the whole slice flushed
169 out.
170
171- For each kind of slice, one lru cache is maintained, new slice is added
172 to the lru cache, and if it is less accessed, the slice will be moved
173 towards end of the lru cache. The lru cache capacity is fixed when
174 starting ublk-qcow2, but it is configurable, and the default size is 1MB,
175 so one lru cache may hold at most 256 l2 or refcount block slices.
176 Finally, one slice may be evicted from the lru cache.
177
178- Grab two reference count in slice_cache<T>::alloc_slice(), so alloc_slice()
179 always returns one valid slice object, but it may not be in the lru list
180 because it can be evicted in nested alloc_slice() if lru capacity is
181 run out of. Note, ->wakeup_all() could trigger another alloc_slice.
182
183- When one slice is evicted from lru cache, one reference is dropped. If
184 the slice is clean, it will be added into per-device free list, which
185 will be iterated over for slice releasing when current IO batch are
186 handled. If the slice is dirty, the slice will be delayed to add to the
187 free list after flushing of this slice is completed.
188
189- when one slice is evicted from lru cache, it is moved to evicted slices
190 map, and the slice is still visible via find_slice(slice key, true), but
191 it becomes read only after being evicted from lru cache.
192
193- one slice is visible via find_slice() from allocation to freeing, and the
194 slice becomes invisible in when the slice is destructed, see
195 Qcow2L2Table::~Qcow2L2Table() and Qcow2RefcountBlock::~Qcow2RefcountBlock()
196
197Cluster state object lifetime
198-----------------------------
199
200Cluster state object is for tracking if one cluster is zeroed, and will be freed
201anytime after its state becomes QCOW2_ALLOC_ZEROED.
202
203Tracking dirty index
204--------------------
205
206For both l2 slice and refcount blk slice, the minimum flushing unit is single
207slice, so we don't trace exact dirty index for the two.
208
209For l1 table and refcount table, the minimum flushing unit is 512byte or logical
210block size, so just track which 512byte unit is dirty.
211
212IOWaiter
213-----------------
214- can't write one slice when the slice is being loaded from image or being
215 stored to image
216- after one slice is evicted from lru cache, it becomes read only automatically,
217 but the in-progress load/flush is guaranteed to be completed.
218- ``class IOWaiter`` is invented for handling all kinds of wait/wakeup, which
219 could become part of libublksrv in future
220
221
222Implementation
223==============
224
225C++
226---
227
228ublk-qcow2 is basically implemented by C++, not depends on any 3rd party
229library, except for in-tree lrucache helper and nlohmann jason lib(only for
230setting up target), and built on c++ standard library almost completely.
231The frequently used component is c++'s unordered map, which is for building
232l2/refcount blk slice lru cache.
233
234c++20 is needed just for the coroutine feature, but the usage(only co_wait()
235and co_resume() is used) is simple, and could be replaced with other
236coroutine implementation if c++20 is one blocker.
237
238
239Coroutine with exception & IO tag
240---------------------------------
241
242IO tag is 1:1 with coroutine context, where the IO is submitted to io_uring, and
243completed finally in this coroutine context. When waiting for io completion,
244coroutine is suspended, and once the io is done by io_uring, the coroutine
245is resumed, then IO handling can move on.
246
247Anywhere depends on one event which is usually modeled as one state change,
248the context represented by io tag is added via io_waiter.add_waiter(),
249then one io exception is thrown, and the exception is caught and the current
250coroutine is suspended. Once the state is changed to expected value, the
251waiter will be waken up via io_waiter.wakeup_all(), then the coroutine
252context waiting for the state change is resumed.
253
254C++20 coroutine is stackless, and it is very efficient, but hard to use,
255and it doesn't support nested coroutine, so programming with C++20 coroutine
256is not very easy, and this area should be improved in future.
257
258References
259==========
260
261.. [#qloop] https://upcommons.upc.edu/bitstream/handle/2099.1/9619/65757.pdf?sequence=1&isAllowed=y
262.. [#dm_qcow2] https://lwn.net/Articles/889429/
263.. [#in_kernel_qcow2_ro] https://lab.ks.uni-freiburg.de/projects/kernel-qcow2/repository
264