Name Date Size #Lines LOC

..--

README.rstH A D25-Apr-202510.1 KiB264187

STATUS.rstH A D25-Apr-20251.6 KiB4328

lrucache.hppH A D25-Apr-20253.6 KiB12570

qcow2.cppH A D25-Apr-202526.3 KiB1,021752

qcow2.hH A D25-Apr-202518.7 KiB756508

qcow2_common.hH A D25-Apr-20253.9 KiB171118

qcow2_flush_meta.cppH A D25-Apr-202515.2 KiB655514

qcow2_format.hH A D25-Apr-202513.1 KiB379217

qcow2_meta.cppH A D25-Apr-202528 KiB1,136884

qcow2_meta.hH A D25-Apr-202518.4 KiB763613

qemu_dep.hH A D25-Apr-2025560 2818

tgt_qcow2.cppH A D25-Apr-202514.6 KiB557456

utils.cppH A D25-Apr-20252.2 KiB8159

README.rst

1
2==========
3ublk-qcow2
4==========
5
6Motivation
7==========
8
9ublk-qcow2 is started for serving for the four purposes:
10
11- building one complicated target from scratch helps libublksrv APIs/functions
12  become mature/stable more quickly, since qcow2 is complicated and needs more
13  requirement from libublksrv compared with other simple ones(loop, null)
14
15- there are several attempts of implementing qcow2 driver in kernel, such as
16  ``qloop`` [#qloop]_, ``dm-qcow2`` [#dm_qcow2]_ and
17  ``in kernel qcow2(ro)`` [#in_kernel_qcow2_ro]_, so ublk-qcow2 might useful
18  for covering requirement in this field
19
20- performance comparison with qemu-nbd, and it was my 1st thought to evaluate
21  performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
22  is started
23
24- help to abstract common building block or design pattern for writing new ublk
25  target/backend
26
27Howto
28=====
29
30ublk add -t qcow2 -f $PATH_QCOW2_IMG
31
32So far not add any command line options yet. The default L2 cache size is 1MB,
33and default refcount cache size is 256KB. Both l2 and refcount slice size is
344K. With DEBUG_QCOW2_META_STRESS enabled, two l2 slices and refcount slices
35are allowed, and ublk-qcow2 is verified with this minimum cache size setting.
36
37
38Design
39======
40
41Based on ublk framework
42-----------------------
43
44Based on libublksrv and common target code
45
46IO size
47-------
48
49For simplifying handling of cluster mapping, the chunk_sectors of block layer
50queue limit is aligned with QCOW2's cluster size, this way guarantees that at
51most one l2 lookup is needed for handling one ublk-qcow2 IO, meantime one time
52of IO is enough to handling one ublk-qcow2 IO. But this way may hurt big chunk
53sequential IO a bit. In future, the chunk_sectors may be increased to 512KB,
54then it is enough to load L2 slice at most once for handling one ublk IO, but
55this big IO needs to be splitted to at most 512K/cluster_size small IOs.
56
57
58Async io
59--------
60
61Target/backend is implemented by io_uring only, and shares same io_uring
62for handling both ublk io command and qcow2 IOs.
63
64Any IO from ublk driver has one unique tag, and any meta IO is assigned by one
65tag from ublk-qcow2 too. Each IO(includes meta IO) is handled in one coroutine
66context, so coroutine is always bound with one unique IO tag. IO is always
67submitted via io_uring in async style, then the coroutine is suspended after
68the submission. Once the IO is completed, the coroutine is resumed for further
69processing.
70
71Metadata update
72---------------
73
74soft update approach is taken for maintaining qcow2 meta-data integrity in the
75event of a crash or power outage.
76
77All metadata is updated asynchronously.
78
79- meta entry dependency on cluster
80
81  When one entry of l1/refcount table/l2/refcount blk table needs to be
82  updated: 1) if the pointed cluster needs to be allocated, the entry is
83  updated after the allocated cluster is discarded/zeroed, then any
84  following reading on this mapping will get correct data. During the
85  period, any read on any sectors in this cluster will return zero, and
86  any write IO won't be started until the entry is updated. So cluster
87  discard/zeroed is always done before updating meta entry pointing to
88  this cluster and writing io data to any sector in this cluster.
89
90- io data writing depends on zeroed cluster
91
92  If the cluster isn't zeroed, the io write has to wait until the zeroing
93  is done; the io read has to return zero during the period of zeroing
94  cluster
95
96- L2/refcount blk entry can be writeback iff the pointed cluster is zeroed
97
98  Meantime the cluster for holding the table needs to be zeroed too
99
100- L1 entry depends on l2 table(cache slice)
101
102  The L1 dirty entry can only be updated iff the pointed l2 table becomes
103  clean, that means: 1) the pointed cluster needs to be zeroed; 2) all dirty
104  slices need to be updated
105
106- refcount table entry depends on refcount blk
107
108  The refcount table dirty entry can only be updated iff the pointed refcount
109  blk becomes clean, that means: 1) the pointed cluster needs to be zeroed; 2)
110  all dirty slices need to be updated
111
112
113Meta data flushing to image
114---------------------------
115
116When any meta(L1/L2/refcount table/refcount blk) is being flushed to image,
117IO code path can't update the in-ram meta data until the meta is flushed to
118image, when the dirty flag is cleared.
119
120Any meta is always flushed in background:
121
122- when cache slice is added to dirty list, these cache slices will be started
123  to flush after all current IOs are handled
124
125- meta data flushing when io_uring is idle
126
127- periodic meta data flushing
128
129How to flushing meta data
130~~~~~~~~~~~~~~~~~~~~~~~~~
131
1321) allocate one tag for flushing one meta chain, and soft update has to be
133  respected, start from the lowest cluster zeroing IO to the upper layer of
134  updating l1 or refcount table
135
1362) from implementation viewpoint, find the meta flush chains from top to bottom
137
138  - find one oldest dirty entry in top meta(l1 or refcount table) or
139  specified index(flushing from slice dirty list), suppose the index is A,
140  then figure out all dirty entries in the 512 byte range which includes
141  index A
142
143  - for each dirty entry in the candidates
144     -- for each dirty slices in this cluster pointed by the dirty entry,
145     check if any pointed cluster by the slice is zeroed, if there is any,
146     wait until all clusters are zeroed
147
148     -- figure out the pointed cluster, if the cluster isn't zeroed yet,
149     zero it now
150
151     -- flushing all dirty slices in this cluster
152
153  - flush all meta entries in this 512byte area
154
155How to retrieve meta object after the meta io is done
156-----------------------------------------------------
157
158- use add_meta_io/del_meta_io/get_meta_io to meta flushing
159
160
161L2/refcount blk slice lifetime
162------------------------------
163
164- meta slice idea is from QEMU, and both l2/refcount block table takes one
165  cluster, and slice size is configurable, and at default both l2 &
166  refcount block slice is 4K, so either one l2 mapping is needed or
167  refcount block meta is needed, just the 4k part is loaded from image,
168  and when flushing slice to image, it is still the whole slice flushed
169  out.
170
171- For each kind of slice, one lru cache is maintained, new slice is added
172  to the lru cache, and if it is less accessed, the slice will be moved
173  towards end of the lru cache. The lru cache capacity is fixed when
174  starting ublk-qcow2, but it is configurable, and the default size is 1MB,
175  so one lru cache may hold at most 256 l2 or refcount block slices.
176  Finally, one slice may be evicted from the lru cache.
177
178- Grab two reference count in slice_cache<T>::alloc_slice(), so alloc_slice()
179  always returns one valid slice object, but it may not be in the lru list
180  because it can be evicted in nested alloc_slice() if lru capacity is
181  run out of. Note, ->wakeup_all() could trigger another alloc_slice.
182
183- When one slice is evicted from lru cache, one reference is dropped. If
184  the slice is clean, it will be added into per-device free list, which
185  will be iterated over for slice releasing when current IO batch are
186  handled. If the slice is dirty, the slice will be delayed to add to the
187  free list after flushing of this slice is completed.
188
189- when one slice is evicted from lru cache, it is moved to evicted slices
190  map, and the slice is still visible via find_slice(slice key, true), but
191  it becomes read only after being evicted from lru cache.
192
193- one slice is visible via find_slice() from allocation to freeing, and the
194  slice becomes invisible in when the slice is destructed, see
195  Qcow2L2Table::~Qcow2L2Table() and Qcow2RefcountBlock::~Qcow2RefcountBlock()
196
197Cluster state object lifetime
198-----------------------------
199
200Cluster state object is for tracking if one cluster is zeroed, and will be freed
201anytime after its state becomes QCOW2_ALLOC_ZEROED.
202
203Tracking dirty index
204--------------------
205
206For both l2 slice and refcount blk slice, the minimum flushing unit is single
207slice, so we don't trace exact dirty index for the two.
208
209For l1 table and refcount table, the minimum flushing unit is 512byte or logical
210block size, so just track which 512byte unit is dirty.
211
212IOWaiter
213-----------------
214- can't write one slice when the slice is being loaded from image or being
215  stored to image
216- after one slice is evicted from lru cache, it becomes read only automatically,
217  but the in-progress load/flush is guaranteed to be completed.
218- ``class IOWaiter`` is invented for handling all kinds of wait/wakeup, which
219  could become part of libublksrv in future
220
221
222Implementation
223==============
224
225C++
226---
227
228ublk-qcow2 is basically implemented by C++, not depends on any 3rd party
229library, except for in-tree lrucache helper and nlohmann jason lib(only for
230setting up target), and built on c++ standard library almost completely.
231The frequently used component is c++'s unordered map, which is for building
232l2/refcount blk slice lru cache.
233
234c++20 is needed just for the coroutine feature, but the usage(only co_wait()
235and co_resume() is used) is simple, and could be replaced with other
236coroutine implementation if c++20 is one blocker.
237
238
239Coroutine with exception & IO tag
240---------------------------------
241
242IO tag is 1:1 with coroutine context, where the IO is submitted to io_uring, and
243completed finally in this coroutine context. When waiting for io completion,
244coroutine is suspended, and once the io is done by io_uring, the coroutine
245is resumed, then IO handling can move on.
246
247Anywhere depends on one event which is usually modeled as one state change,
248the context represented by io tag is added via io_waiter.add_waiter(),
249then one io exception is thrown, and the exception is caught and the current
250coroutine is suspended. Once the state is changed to expected value, the
251waiter will be waken up via io_waiter.wakeup_all(), then the coroutine
252context waiting for the state change is resumed.
253
254C++20 coroutine is stackless, and it is very efficient, but hard to use,
255and it doesn't support nested coroutine, so programming with C++20 coroutine
256is not very easy, and this area should be improved in future.
257
258References
259==========
260
261.. [#qloop] https://upcommons.upc.edu/bitstream/handle/2099.1/9619/65757.pdf?sequence=1&isAllowed=y
262.. [#dm_qcow2] https://lwn.net/Articles/889429/
263.. [#in_kernel_qcow2_ro] https://lab.ks.uni-freiburg.de/projects/kernel-qcow2/repository
264