xref: /aosp_15_r20/external/pigweed/seed/0109.rst (revision 61c4878ac05f98d0ceed94b57d316916de578985)
1.. _seed-0109:
2
3===========================
40109: Communication Buffers
5===========================
6.. seed::
7   :number: 109
8   :name: Communication Buffers
9   :status: Accepted
10   :proposal_date: 2023-08-28
11   :cl: 168357
12   :authors: Taylor Cramer
13   :facilitator: Erik Gilling
14
15-------
16Summary
17-------
18This SEED proposes that Pigweed adopt a standard buffer type for network-style
19communications. This buffer type will be used in the new sockets API
20(see the recently-accepted `Communications SEED 107
21<https://pigweed.dev/seed/0107-communications.html>`_ for more details), and
22will be carried throughout the communications stack as-appropriate.
23
24------------------------
25Top-Level Usage Examples
26------------------------
27
28Sending a Proto Into a Socket
29=============================
30.. code-block:: cpp
31
32   Allocator& msg_allocator = socket.Allocator();
33   size_t size = EncodedProtoSize(some_proto);
34   std::optional<pw::MultiBuf> buffer = msg_allocator.Allocate(size);
35   if (!buffer) { return; }
36   EncodeProto(some_proto, *buffer);
37   socket.Send(std::move(*buffer));
38
39``Socket`` s provide an allocator which should be used to create a
40``pw::MultiBuf``. The data to be sent is then filled into this buffer before
41being sent. ``Socket`` must accept ``pw::MultiBuf`` s which were not created
42by their own ``Allocator``, but this may incur a performance penalty.
43
44Zero-Copy Fragmentation
45=======================
46
47The example above has a hidden feature: zero-copy fragmentation! The socket
48can provide a ``pw::MultiBuf`` which is divided up into separate MTU-sized
49chunks, each of which has reserved space for headers and/or footers:
50
51.. code-block:: cpp
52
53   std::optional<pw::MultiBuf> Allocate(size_t size) {
54     if (size == 0) { return pw::MultiBuf(); }
55     size_t data_per_chunk = mtu_size_ - header_size_;
56     size_t num_chunks = 1 + ((size - 1) / data_per_chunk);
57     std::optional<pw::MultiBuf> buffer = pw::MultiBuf::WithCapacity(num_chunks);
58     if (!buffer) { return std::nullopt; }
59     for (size_t i = 0; i < num_chunks; i++) {
60       // Note: we could allocate smaller than `mtu_size_` for the final
61       // chunk. This is ommitted here for brevity.
62       std::optional<pw::MultiBuf::Chunk> chunk = internal_alloc_.Allocate(mtu_size_);
63       if (!chunk) { return std::nullopt; }
64       // Reserve the header size by discarding the front of each buffer.
65       chunk->DiscardFront(header_size_);
66       buffer->Chunks()[i] = std::move(chunk);
67     }
68     return *buffer;
69   }
70
71This code reserves ``header_size_`` bytes at the beginning of each ``Chunk``.
72When the caller writes into this memory and then passes it back to the socket,
73these bytes can be claimed for the header using the ``ClaimPrefix`` function.
74
75One usecase that seems to demand the ability to fragment like this is breaking
76up ``SOCK_SEQPACKET`` messages which (at least on Unix / Linux) may be much larger
77than the MTU size: up to ``SO_SNDBUF`` (see `this man page
78<https://man7.org/linux/man-pages/man7/socket.7.html>`_).
79
80Multiplexing Incoming Data
81==========================
82
83.. code-block:: cpp
84
85   [[nodiscard]] bool SplitAndSend(pw::MultiBuf buffer) {
86     std::optional<std::array<pw::MultiBuf, 2>> buffers =
87       std::move(buffer).Split(split_index);
88     if (!buffers) { return false; }
89     socket_1_.Send(std::move(buffers->at(0)));
90     socket_2_.Send(std::move(buffers->at(1)));
91     return true;
92   }
93
94Incoming buffers can be split without copying, and the results can be forwarded
95to multiple different ``Socket`` s, RPC services or clients.
96
97----------
98Motivation
99----------
100Today, a Pigweed communications stack typically involves a number of different
101buffers.
102
103``pw_rpc`` users, for example, frequently use direct-memory access (DMA) or
104other system primitives to read data into a buffer, apply some link-layer
105protocol such as HDLC which copies data into a second buffer, pass the resulting
106data into pw_rpc which parses it into its own buffer. Multiple sets of buffers
107are present on the output side as well. Between DMA in and DMA out, it's easy
108for data to pass through six or more different buffers.
109
110These independent buffer systems introduce both time and space overhead. Aside
111from the additional CPU time required to move the data around, users need to
112ensure that all of the different buffer pools involved along the way have enough
113space reserved to contain the entire message. Where caching is present, moving
114the memory between locations may create an additional delay by thrashing
115between memory regions.
116
117--------
118Proposal
119--------
120``pw::buffers::MultiBuf`` is a handle to a buffer optimized for use within
121Pigweed's communications stack. It provides efficient, low-overhead buffer
122management, and serves as a standard type for passing data between drivers,
123TCP/IP implementations, RPC 2.0, and transfer 2.0.
124
125A single ``MultiBuf`` is a list of ``Chunk`` s of data. Each ``Chunk``
126points to an exclusively-owned portion of a reference-counted allocation.
127``MultiBuf`` s can be easily split, joined, prefixed, postfixed, or infixed
128without needing to copy the underlying data.
129
130The memory pointed to by ``Chunk`` s is typically allocated from a pool
131provided by a ``Socket``. This allows the ``Socket`` to provide backpressure to
132callers, and to ensure that memory is placed in DMA or shared memory regions
133as-necessary.
134
135In-Memory Layout
136================
137
138This diagram shows an example in-memory relationship between two buffers:
139- ``Buffer1`` references one chunks from region A.
140- ``Buffer2`` references two chunk from regions A and B.
141
142.. mermaid::
143
144   graph TB;
145   Buffer1 --> Chunk1A
146   Chunk1A -- "[0..64]" --> MemoryRegionA(Memory Region A)
147   Chunk1A --> ChunkRegionTrackerA
148   Buffer2 --> Chunk2A & Chunk2B
149   Chunk2A --> ChunkRegionTrackerA
150   Chunk2A -- "[64..128]" --> MemoryRegionA(Memory Region A)
151   Chunk2B -- "[0..128]" --> MemoryRegionB
152   Chunk2B --> ChunkRegionTrackerB
153
154API
155===
156
157The primary API is as follows:
158
159.. code-block:: cpp
160
161   /// An object that manages a single allocated region which is referenced
162   /// by one or more chunks.
163   class ChunkRegionTracker {
164    public:
165     ChunkRegionTracker(ByteSpan);
166
167     /// Creates the first ``Chunk`` referencing a whole region of memory.
168     /// This must only be called once per ``ChunkRegionTracker``.
169     Chunk ChunkForRegion();
170
171    protected:
172     pw::Mutex lock();
173
174     /// Destroys the `ChunkRegionTracker`.
175     ///
176     /// Typical implementations will call `std::destroy_at(this)` and then
177     /// free the memory associated with the tracker.
178     virtual void Destroy();
179
180     /// Defines the entire span of the region being managed. ``Chunk`` s
181     /// referencing this tracker may not expand beyond this region
182     /// (or into one another).
183     ///
184     /// This region must not change for the lifetime of this
185     /// ``ChunkRegionTracker``.
186     virtual ByteSpan region();
187
188    private:
189     /// Used to manage the internal linked-list of ``Chunk`` s that allows
190     /// chunks to know whether or not they can expand to fill neighboring
191     /// regions of memory.
192     pw::Mutex lock_;
193   };
194
195   /// A handle to a contiguous refcounted slice of data.
196   ///
197   /// Note: this Chunk may acquire a ``pw::sync::Mutex`` during destruction, and
198   /// so must not be destroyed within ISR context.
199   class Chunk {
200    public:
201     Chunk();
202     Chunk(ChunkRegionTracker&);
203     Chunk(Chunk&& other) noexcept;
204     Chunk& operator=(Chunk&& other);
205     ~Chunk();
206     std::byte* data();
207     const std::byte* data() const;
208     ByteSpan span();
209     ConstByteSpan span() const;
210     size_t size() const;
211
212     std::byte& operator[](size_t index);
213     std::byte operator[](size_t index) const;
214
215     /// Decrements the reference count on the underlying chunk of data and empties
216     /// this handle so that `span()` now returns an empty (zero-sized) span.
217     ///
218     /// Does not modify the underlying data, but may cause it to be
219     /// released if this was the only remaining ``Chunk`` referring to it.
220     /// This is equivalent to ``{ Chunk _unused = std::move(chunk_ref); }``
221     void Release();
222
223     /// Attempts to add `prefix_bytes` bytes to the front of this buffer by
224     /// advancing its range backwards in memory. Returns `true` if the
225     /// operation succeeded.
226     ///
227     /// This will only succeed if this `Chunk` points to a section of a chunk
228     /// that has unreferenced bytes preceeding it. For example, a `Chunk`
229     /// which has been shrunk using `DiscardFront` can then be re-expanded using
230     /// `ClaimPrefix`.
231     ///
232     /// Note that this will fail if a preceding `Chunk` has claimed this
233     /// memory using `ClaimSuffix`.
234     [[nodiscard]] bool ClaimPrefix(size_t prefix_bytes);
235
236     /// Attempts to add `suffix_bytes` bytes to the back of this buffer by
237     /// advancing its range forwards in memory. Returns `true` if the
238     /// operation succeeded.
239     ///
240     /// This will only succeed if this `Chunk` points to a section of a chunk
241     /// that has unreferenced bytes following it. For example, a `Chunk`
242     /// which has been shrunk using `Truncate` can then be re-expanded using
243     /// `ClaimSuffix`.
244     ///
245     /// Note that this will fail if a following `Chunk` has claimed this
246     /// memory using `ClaimPrefix`.
247     [[nodiscard]] bool ClaimSuffix(size_t suffix_bytes);
248
249     /// Shrinks this handle to refer to the data beginning at offset ``new_start``.
250     ///
251     /// Does not modify the underlying data.
252     void DiscardFront(size_t new_start);
253
254     /// Shrinks this handle to refer to data in the range ``begin..<end``.
255     ///
256     /// Does not modify the underlying data.
257     void Slice(size_t begin, size_t end);
258
259     /// Shrinks this handle to refer to only the first ``len`` bytes.
260     ///
261     /// Does not modify the underlying data.
262     void Truncate(size_t len);
263
264     /// Splits this chunk in two, with the second chunk starting at `split_point`.
265     std::array<Chunk, 2> Split(size_t split_point) &&;
266   };
267
268   /// A handle to a sequence of potentially non-contiguous refcounted slices of
269   /// data.
270   class MultiBuf {
271    public:
272     struct Index {
273       size_t chunk_index;
274       size_t byte_index;
275     };
276
277     MultiBuf();
278
279     /// Creates a ``MultiBuf`` pointing to a single, contiguous chunk of data.
280     ///
281     /// Returns ``std::nullopt`` if the ``ChunkList`` allocator is out of memory.
282     static std::optional<MultiBuf> FromChunk(Chunk chunk);
283
284     /// Splits the ``MultiBuf`` into two separate buffers at ``split_point``.
285     ///
286     /// Returns ``std::nullopt`` if the ``ChunkList`` allocator is out of memory.
287     std::optional<std::array<MultiBuf, 2>> Split(Index split_point) &&;
288     std::optional<std::array<MultiBuf, 2>> Split(size_t split_point) &&;
289
290     /// Appends the contents of ``suffix`` to this ``MultiBuf`` without copying data.
291     /// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
292     [[nodiscard]] bool Append(MultiBuf suffix);
293
294     /// Discards the first elements of ``MultiBuf`` up to (but not including)
295     /// ``new_start``.
296     ///
297     /// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
298     [[nodiscard]] bool DiscardFront(Index new_start);
299     [[nodiscard]] bool DiscardFront(size_t new_start);
300
301     /// Shifts and truncates this handle to refer to data in the range
302     /// ``begin..<stop``.
303     ///
304     /// Does not modify the underlying data.
305     ///
306     /// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
307     [[nodiscard]] bool Slice(size_t begin, size_t end);
308
309     /// Discards the tail of this ``MultiBuf`` starting with ``first_index_to_drop``.
310     /// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
311     [[nodiscard]] bool Truncate(Index first_index_to_drop);
312     /// Discards the tail of this ``MultiBuf`` so that only ``len`` elements remain.
313     /// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
314     [[nodiscard]] bool Truncate(size_t len);
315
316     /// Discards the elements beginning with ``cut_start`` and resuming at
317     /// ``resume_point``.
318     ///
319     /// Returns ``false`` if the ``ChunkList`` allocator is out of memory.
320     [[nodiscard]] bool DiscardSegment(Index cut_start, Index resume_point);
321
322     /// Returns an iterable over the ``Chunk``s of memory within this ``MultiBuf``.
323     auto Chunks();
324     auto Chunks() const;
325
326     /// Returns a ``BidirectionalIterator`` over the bytes in this ``MultiBuf``.
327     ///
328     /// Note that use of this iterator type may be less efficient than
329     /// performing chunk-wise operations due to the noncontiguous nature of
330     /// the iterator elements.
331     auto begin();
332     auto end();
333
334     /// Counts the total number of ``Chunk`` s.
335     ///
336     /// This function is ``O(n)`` in the number of ``Chunk`` s.
337     size_t CalculateNumChunks() const;
338
339     /// Counts the total size in bytes of all ``Chunk`` s combined.
340     ///
341     /// This function is ``O(n)`` in the number of ``Chunk`` s.
342     size_t CalculateTotalSize() const;
343
344     /// Returns an ``Index`` which can be used to provide constant-time access to
345     /// the element at ``position``.
346     ///
347     /// Any mutation of this ``MultiBuf`` (e.g. ``Append``, ``DiscardFront``,
348     /// ``Slice``, or ``Truncate``) may invalidate this ``Index``.
349     Index IndexAt(size_t position) const;
350   };
351
352
353   class MultiBufAllocationFuture {
354    public:
355     Poll<std::optional<Buffer>> Poll(Context&);
356   };
357
358   class MultiBufAllocationFuture {
359    public:
360     Poll<std::optional<MultiBuf::Chunk>> Poll(Context&);
361   };
362
363   class MultiBufAllocator {
364    public:
365     std::optional<MultiBuf> Allocate(size_t size);
366     std::optional<MultiBuf> Allocate(size_t min_size, size_t desired_size);
367     std::optional<MultiBuf::Chunk> AllocateContiguous(size_t size);
368     std::optional<MultiBuf::Chunk> AllocateContiguous(size_t min_size, size_t desired_size);
369
370     MultiBufAllocationFuture AllocateAsync(size_t size);
371     MultiBufAllocationFuture AllocateAsync(size_t min_size, size_t desired_size);
372     MultiBufChunkAllocationFuture AllocateContiguousAsync(size_t size);
373     MultiBufChunkAllocationFuture AllocateContiguousAsync(size_t min_size, size_t desired_size);
374
375    protected:
376     virtual std::optional<MultiBuf> DoAllocate(size_t min_size, size_t desired_size);
377     virtual std::optional<MultiBuf::Chunk> DoAllocateContiguous(size_t min_size, size_t desired_size);
378
379     /// Invoked by the ``MultiBufAllocator`` to signal waiting futures that buffers of
380     /// the provided sizes may be available for allocation.
381     void AllocationAvailable(size_t size_available, size_t contiguous_size_available);
382   };
383
384
385The ``Chunk`` s in the prototype are stored in fallible dynamically-allocated
386arrays, but they could be stored in vectors of a fixed maximum size. The ``Chunk`` s
387cannot be stored as an intrusively-linked list because this would require per-``Chunk``
388overhead in the underlying buffer, and there may be any number of ``Chunk`` s within
389the same buffer.
390
391Multiple ``Chunk`` s may not refer to the same memory, but they may refer to
392non-overlapping sections of memory within the same region. When one ``Chunk``
393within a region is deallocated, a neighboring chunk may expand to use its space.
394
395--------------------
396Vectorized Structure
397--------------------
398The most significant design choices made in this API is supporting vectorized
399IO via a list of ``Chunk`` s. While this does carry an additional overhead, it
400is strongly motivated by the desire to carry data throughout the communications
401stack without needing a copy. By carrying a list of ``Chunk`` s, ``MultiBuf``
402allows data to be prefixed, suffixed, infixed, or split without incurring the
403overhead of a separate allocation and copy.
404
405--------------------------------------------------------------------------
406Managing Allocations with ``MultiBufAllocator`` and ``ChunkRegionTracker``
407--------------------------------------------------------------------------
408``pw::MultiBuf`` is not associated with a concrete allocator implementation.
409Instead, it delegates the creation of buffers to implementations of
410the ``MultiBufAllocator`` base class. This allows different allocator
411implementations to vend out ``pw::MultiBuf`` s that are optimized for use with a
412particular communications stack.
413
414For example, a communications stack which runs off of shared memory or specific
415DMA'able regions might choose to allocate memory out of those regions to allow
416for zero-copy writes.
417
418Additionally, custom allocator implementations can reserve headers, footers, or
419even split a ``pw::MultiBuf`` into multiple chunks in order to provide zero-copy
420fragmentation.
421
422Deallocation of these regions is managed through the ``ChunkRegionTracker``.
423When no more ``Chunk`` s associated with a ``ChunkRegionTracker`` exist,
424the ``ChunkRegionTracker`` receives a ``Destroy`` call to release both the
425region and the ``ChunkRegionTracker``.
426
427The ``MultiBufAllocator`` can place the ``ChunkRegionTracker`` wherever it
428wishes, including as a header or footer for the underlying region allocation.
429This is not required, however, as memory in regions like shared or DMA'able
430memory might be limited, in which case the ``ChunkRegionTracker`` can be
431stored elsewhere.
432
433-----------------------------------------------------
434Compatibility With Existing Communications Interfaces
435-----------------------------------------------------
436
437lwIP
438====
439`Lightweight IP stack (lwIP)
440<https://www.nongnu.org/lwip>`_
441is a TCP/IP implementation designed for use on small embedded systems. Some
442Pigweed platforms may choose to use lwIP as the backend for their ``Socket``
443implementations, so it's important that ``pw::MultiBuf`` interoperate efficiently
444with their native buffer type.
445
446lwIP has its own buffer type, `pbuf
447<https://www.nongnu.org/lwip/2_1_x/structpbuf.html>`_ optimized for use with
448`zero-copy applications
449<https://www.nongnu.org/lwip/2_1_x/zerocopyrx.html>`_.
450``pbuf`` represents buffers as linked lists of reference-counted ``pbuf`` s
451which each have a pointer to their payload, a length, and some metadata. While
452this does not precisely match the representation of ``pw::MultiBuf``, it is
453possible to construct a ``pbuf`` list which points at the various chunks of a
454``pw::MultiBuf`` without needing to perform a copy of the data.
455
456Similarly, a ``pw::MultiBuf`` can refer to a ``pbuf`` by using each ``pbuf`` as
457a "``Chunk`` region", holding a reference count on the region's ``pbuf`` so
458long as any ``Chunk`` continues to reference the data referenced by that
459buffer.
460
461------------------
462Existing Solutions
463------------------
464
465Linux's ``sk_buff``
466===================
467Linux has a similar buffer structure called `sk_buff
468<https://docs.kernel.org/networking/skbuff.html#struct-sk-buff>`_.
469It differs in a few significant ways:
470
471It provides separate ``head``, ``data``, ``tail``, and ``end`` pointers.
472Other scatter-gather fragments are supplied using the ``frags[]`` structure.
473
474Separately, it has a ``frags_list`` of IP fragments which is created via calls to
475``ip_push_pending_frames``. Fragments are supplied by the ``frags[]`` payload in
476addition to the ``skb_shared_info.frag_list`` pointing to the tail.
477
478``sk_buff`` reference-counts not only the underlying chunks of memory, but also the
479``sk_buff`` structure itself. This allows for clones of ``sk_buff`` without
480manipulating the reference counts of the individual chunks. We anticipate that
481cloning a whole ``pw::buffers::MultiBuf`` will be uncommon enough that it is
482better to keep these structures single-owner (and mutable) rather than sharing
483and reference-counting them.
484
485FreeBSD's ``mbuf``
486==================
487FreeBSD uses a design called `mbuf
488<https://man.freebsd.org/cgi/man.cgi?query=mbuf>`_
489which interestingly allows data within the middle of a buffer to be given a
490specified alignment, allowing data to be prepended within the same buffer.
491This is similar to the structure of ``Chunk``, which may reference data in the
492middle of an allocated buffer, allowing prepending without a copy.
493