1.. _seed-0109: 2 3=========================== 40109: Communication Buffers 5=========================== 6.. seed:: 7 :number: 109 8 :name: Communication Buffers 9 :status: Accepted 10 :proposal_date: 2023-08-28 11 :cl: 168357 12 :authors: Taylor Cramer 13 :facilitator: Erik Gilling 14 15------- 16Summary 17------- 18This SEED proposes that Pigweed adopt a standard buffer type for network-style 19communications. This buffer type will be used in the new sockets API 20(see the recently-accepted `Communications SEED 107 21<https://pigweed.dev/seed/0107-communications.html>`_ for more details), and 22will be carried throughout the communications stack as-appropriate. 23 24------------------------ 25Top-Level Usage Examples 26------------------------ 27 28Sending a Proto Into a Socket 29============================= 30.. code-block:: cpp 31 32 Allocator& msg_allocator = socket.Allocator(); 33 size_t size = EncodedProtoSize(some_proto); 34 std::optional<pw::MultiBuf> buffer = msg_allocator.Allocate(size); 35 if (!buffer) { return; } 36 EncodeProto(some_proto, *buffer); 37 socket.Send(std::move(*buffer)); 38 39``Socket`` s provide an allocator which should be used to create a 40``pw::MultiBuf``. The data to be sent is then filled into this buffer before 41being sent. ``Socket`` must accept ``pw::MultiBuf`` s which were not created 42by their own ``Allocator``, but this may incur a performance penalty. 43 44Zero-Copy Fragmentation 45======================= 46 47The example above has a hidden feature: zero-copy fragmentation! The socket 48can provide a ``pw::MultiBuf`` which is divided up into separate MTU-sized 49chunks, each of which has reserved space for headers and/or footers: 50 51.. code-block:: cpp 52 53 std::optional<pw::MultiBuf> Allocate(size_t size) { 54 if (size == 0) { return pw::MultiBuf(); } 55 size_t data_per_chunk = mtu_size_ - header_size_; 56 size_t num_chunks = 1 + ((size - 1) / data_per_chunk); 57 std::optional<pw::MultiBuf> buffer = pw::MultiBuf::WithCapacity(num_chunks); 58 if (!buffer) { return std::nullopt; } 59 for (size_t i = 0; i < num_chunks; i++) { 60 // Note: we could allocate smaller than `mtu_size_` for the final 61 // chunk. This is ommitted here for brevity. 62 std::optional<pw::MultiBuf::Chunk> chunk = internal_alloc_.Allocate(mtu_size_); 63 if (!chunk) { return std::nullopt; } 64 // Reserve the header size by discarding the front of each buffer. 65 chunk->DiscardFront(header_size_); 66 buffer->Chunks()[i] = std::move(chunk); 67 } 68 return *buffer; 69 } 70 71This code reserves ``header_size_`` bytes at the beginning of each ``Chunk``. 72When the caller writes into this memory and then passes it back to the socket, 73these bytes can be claimed for the header using the ``ClaimPrefix`` function. 74 75One usecase that seems to demand the ability to fragment like this is breaking 76up ``SOCK_SEQPACKET`` messages which (at least on Unix / Linux) may be much larger 77than the MTU size: up to ``SO_SNDBUF`` (see `this man page 78<https://man7.org/linux/man-pages/man7/socket.7.html>`_). 79 80Multiplexing Incoming Data 81========================== 82 83.. code-block:: cpp 84 85 [[nodiscard]] bool SplitAndSend(pw::MultiBuf buffer) { 86 std::optional<std::array<pw::MultiBuf, 2>> buffers = 87 std::move(buffer).Split(split_index); 88 if (!buffers) { return false; } 89 socket_1_.Send(std::move(buffers->at(0))); 90 socket_2_.Send(std::move(buffers->at(1))); 91 return true; 92 } 93 94Incoming buffers can be split without copying, and the results can be forwarded 95to multiple different ``Socket`` s, RPC services or clients. 96 97---------- 98Motivation 99---------- 100Today, a Pigweed communications stack typically involves a number of different 101buffers. 102 103``pw_rpc`` users, for example, frequently use direct-memory access (DMA) or 104other system primitives to read data into a buffer, apply some link-layer 105protocol such as HDLC which copies data into a second buffer, pass the resulting 106data into pw_rpc which parses it into its own buffer. Multiple sets of buffers 107are present on the output side as well. Between DMA in and DMA out, it's easy 108for data to pass through six or more different buffers. 109 110These independent buffer systems introduce both time and space overhead. Aside 111from the additional CPU time required to move the data around, users need to 112ensure that all of the different buffer pools involved along the way have enough 113space reserved to contain the entire message. Where caching is present, moving 114the memory between locations may create an additional delay by thrashing 115between memory regions. 116 117-------- 118Proposal 119-------- 120``pw::buffers::MultiBuf`` is a handle to a buffer optimized for use within 121Pigweed's communications stack. It provides efficient, low-overhead buffer 122management, and serves as a standard type for passing data between drivers, 123TCP/IP implementations, RPC 2.0, and transfer 2.0. 124 125A single ``MultiBuf`` is a list of ``Chunk`` s of data. Each ``Chunk`` 126points to an exclusively-owned portion of a reference-counted allocation. 127``MultiBuf`` s can be easily split, joined, prefixed, postfixed, or infixed 128without needing to copy the underlying data. 129 130The memory pointed to by ``Chunk`` s is typically allocated from a pool 131provided by a ``Socket``. This allows the ``Socket`` to provide backpressure to 132callers, and to ensure that memory is placed in DMA or shared memory regions 133as-necessary. 134 135In-Memory Layout 136================ 137 138This diagram shows an example in-memory relationship between two buffers: 139- ``Buffer1`` references one chunks from region A. 140- ``Buffer2`` references two chunk from regions A and B. 141 142.. mermaid:: 143 144 graph TB; 145 Buffer1 --> Chunk1A 146 Chunk1A -- "[0..64]" --> MemoryRegionA(Memory Region A) 147 Chunk1A --> ChunkRegionTrackerA 148 Buffer2 --> Chunk2A & Chunk2B 149 Chunk2A --> ChunkRegionTrackerA 150 Chunk2A -- "[64..128]" --> MemoryRegionA(Memory Region A) 151 Chunk2B -- "[0..128]" --> MemoryRegionB 152 Chunk2B --> ChunkRegionTrackerB 153 154API 155=== 156 157The primary API is as follows: 158 159.. code-block:: cpp 160 161 /// An object that manages a single allocated region which is referenced 162 /// by one or more chunks. 163 class ChunkRegionTracker { 164 public: 165 ChunkRegionTracker(ByteSpan); 166 167 /// Creates the first ``Chunk`` referencing a whole region of memory. 168 /// This must only be called once per ``ChunkRegionTracker``. 169 Chunk ChunkForRegion(); 170 171 protected: 172 pw::Mutex lock(); 173 174 /// Destroys the `ChunkRegionTracker`. 175 /// 176 /// Typical implementations will call `std::destroy_at(this)` and then 177 /// free the memory associated with the tracker. 178 virtual void Destroy(); 179 180 /// Defines the entire span of the region being managed. ``Chunk`` s 181 /// referencing this tracker may not expand beyond this region 182 /// (or into one another). 183 /// 184 /// This region must not change for the lifetime of this 185 /// ``ChunkRegionTracker``. 186 virtual ByteSpan region(); 187 188 private: 189 /// Used to manage the internal linked-list of ``Chunk`` s that allows 190 /// chunks to know whether or not they can expand to fill neighboring 191 /// regions of memory. 192 pw::Mutex lock_; 193 }; 194 195 /// A handle to a contiguous refcounted slice of data. 196 /// 197 /// Note: this Chunk may acquire a ``pw::sync::Mutex`` during destruction, and 198 /// so must not be destroyed within ISR context. 199 class Chunk { 200 public: 201 Chunk(); 202 Chunk(ChunkRegionTracker&); 203 Chunk(Chunk&& other) noexcept; 204 Chunk& operator=(Chunk&& other); 205 ~Chunk(); 206 std::byte* data(); 207 const std::byte* data() const; 208 ByteSpan span(); 209 ConstByteSpan span() const; 210 size_t size() const; 211 212 std::byte& operator[](size_t index); 213 std::byte operator[](size_t index) const; 214 215 /// Decrements the reference count on the underlying chunk of data and empties 216 /// this handle so that `span()` now returns an empty (zero-sized) span. 217 /// 218 /// Does not modify the underlying data, but may cause it to be 219 /// released if this was the only remaining ``Chunk`` referring to it. 220 /// This is equivalent to ``{ Chunk _unused = std::move(chunk_ref); }`` 221 void Release(); 222 223 /// Attempts to add `prefix_bytes` bytes to the front of this buffer by 224 /// advancing its range backwards in memory. Returns `true` if the 225 /// operation succeeded. 226 /// 227 /// This will only succeed if this `Chunk` points to a section of a chunk 228 /// that has unreferenced bytes preceeding it. For example, a `Chunk` 229 /// which has been shrunk using `DiscardFront` can then be re-expanded using 230 /// `ClaimPrefix`. 231 /// 232 /// Note that this will fail if a preceding `Chunk` has claimed this 233 /// memory using `ClaimSuffix`. 234 [[nodiscard]] bool ClaimPrefix(size_t prefix_bytes); 235 236 /// Attempts to add `suffix_bytes` bytes to the back of this buffer by 237 /// advancing its range forwards in memory. Returns `true` if the 238 /// operation succeeded. 239 /// 240 /// This will only succeed if this `Chunk` points to a section of a chunk 241 /// that has unreferenced bytes following it. For example, a `Chunk` 242 /// which has been shrunk using `Truncate` can then be re-expanded using 243 /// `ClaimSuffix`. 244 /// 245 /// Note that this will fail if a following `Chunk` has claimed this 246 /// memory using `ClaimPrefix`. 247 [[nodiscard]] bool ClaimSuffix(size_t suffix_bytes); 248 249 /// Shrinks this handle to refer to the data beginning at offset ``new_start``. 250 /// 251 /// Does not modify the underlying data. 252 void DiscardFront(size_t new_start); 253 254 /// Shrinks this handle to refer to data in the range ``begin..<end``. 255 /// 256 /// Does not modify the underlying data. 257 void Slice(size_t begin, size_t end); 258 259 /// Shrinks this handle to refer to only the first ``len`` bytes. 260 /// 261 /// Does not modify the underlying data. 262 void Truncate(size_t len); 263 264 /// Splits this chunk in two, with the second chunk starting at `split_point`. 265 std::array<Chunk, 2> Split(size_t split_point) &&; 266 }; 267 268 /// A handle to a sequence of potentially non-contiguous refcounted slices of 269 /// data. 270 class MultiBuf { 271 public: 272 struct Index { 273 size_t chunk_index; 274 size_t byte_index; 275 }; 276 277 MultiBuf(); 278 279 /// Creates a ``MultiBuf`` pointing to a single, contiguous chunk of data. 280 /// 281 /// Returns ``std::nullopt`` if the ``ChunkList`` allocator is out of memory. 282 static std::optional<MultiBuf> FromChunk(Chunk chunk); 283 284 /// Splits the ``MultiBuf`` into two separate buffers at ``split_point``. 285 /// 286 /// Returns ``std::nullopt`` if the ``ChunkList`` allocator is out of memory. 287 std::optional<std::array<MultiBuf, 2>> Split(Index split_point) &&; 288 std::optional<std::array<MultiBuf, 2>> Split(size_t split_point) &&; 289 290 /// Appends the contents of ``suffix`` to this ``MultiBuf`` without copying data. 291 /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. 292 [[nodiscard]] bool Append(MultiBuf suffix); 293 294 /// Discards the first elements of ``MultiBuf`` up to (but not including) 295 /// ``new_start``. 296 /// 297 /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. 298 [[nodiscard]] bool DiscardFront(Index new_start); 299 [[nodiscard]] bool DiscardFront(size_t new_start); 300 301 /// Shifts and truncates this handle to refer to data in the range 302 /// ``begin..<stop``. 303 /// 304 /// Does not modify the underlying data. 305 /// 306 /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. 307 [[nodiscard]] bool Slice(size_t begin, size_t end); 308 309 /// Discards the tail of this ``MultiBuf`` starting with ``first_index_to_drop``. 310 /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. 311 [[nodiscard]] bool Truncate(Index first_index_to_drop); 312 /// Discards the tail of this ``MultiBuf`` so that only ``len`` elements remain. 313 /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. 314 [[nodiscard]] bool Truncate(size_t len); 315 316 /// Discards the elements beginning with ``cut_start`` and resuming at 317 /// ``resume_point``. 318 /// 319 /// Returns ``false`` if the ``ChunkList`` allocator is out of memory. 320 [[nodiscard]] bool DiscardSegment(Index cut_start, Index resume_point); 321 322 /// Returns an iterable over the ``Chunk``s of memory within this ``MultiBuf``. 323 auto Chunks(); 324 auto Chunks() const; 325 326 /// Returns a ``BidirectionalIterator`` over the bytes in this ``MultiBuf``. 327 /// 328 /// Note that use of this iterator type may be less efficient than 329 /// performing chunk-wise operations due to the noncontiguous nature of 330 /// the iterator elements. 331 auto begin(); 332 auto end(); 333 334 /// Counts the total number of ``Chunk`` s. 335 /// 336 /// This function is ``O(n)`` in the number of ``Chunk`` s. 337 size_t CalculateNumChunks() const; 338 339 /// Counts the total size in bytes of all ``Chunk`` s combined. 340 /// 341 /// This function is ``O(n)`` in the number of ``Chunk`` s. 342 size_t CalculateTotalSize() const; 343 344 /// Returns an ``Index`` which can be used to provide constant-time access to 345 /// the element at ``position``. 346 /// 347 /// Any mutation of this ``MultiBuf`` (e.g. ``Append``, ``DiscardFront``, 348 /// ``Slice``, or ``Truncate``) may invalidate this ``Index``. 349 Index IndexAt(size_t position) const; 350 }; 351 352 353 class MultiBufAllocationFuture { 354 public: 355 Poll<std::optional<Buffer>> Poll(Context&); 356 }; 357 358 class MultiBufAllocationFuture { 359 public: 360 Poll<std::optional<MultiBuf::Chunk>> Poll(Context&); 361 }; 362 363 class MultiBufAllocator { 364 public: 365 std::optional<MultiBuf> Allocate(size_t size); 366 std::optional<MultiBuf> Allocate(size_t min_size, size_t desired_size); 367 std::optional<MultiBuf::Chunk> AllocateContiguous(size_t size); 368 std::optional<MultiBuf::Chunk> AllocateContiguous(size_t min_size, size_t desired_size); 369 370 MultiBufAllocationFuture AllocateAsync(size_t size); 371 MultiBufAllocationFuture AllocateAsync(size_t min_size, size_t desired_size); 372 MultiBufChunkAllocationFuture AllocateContiguousAsync(size_t size); 373 MultiBufChunkAllocationFuture AllocateContiguousAsync(size_t min_size, size_t desired_size); 374 375 protected: 376 virtual std::optional<MultiBuf> DoAllocate(size_t min_size, size_t desired_size); 377 virtual std::optional<MultiBuf::Chunk> DoAllocateContiguous(size_t min_size, size_t desired_size); 378 379 /// Invoked by the ``MultiBufAllocator`` to signal waiting futures that buffers of 380 /// the provided sizes may be available for allocation. 381 void AllocationAvailable(size_t size_available, size_t contiguous_size_available); 382 }; 383 384 385The ``Chunk`` s in the prototype are stored in fallible dynamically-allocated 386arrays, but they could be stored in vectors of a fixed maximum size. The ``Chunk`` s 387cannot be stored as an intrusively-linked list because this would require per-``Chunk`` 388overhead in the underlying buffer, and there may be any number of ``Chunk`` s within 389the same buffer. 390 391Multiple ``Chunk`` s may not refer to the same memory, but they may refer to 392non-overlapping sections of memory within the same region. When one ``Chunk`` 393within a region is deallocated, a neighboring chunk may expand to use its space. 394 395-------------------- 396Vectorized Structure 397-------------------- 398The most significant design choices made in this API is supporting vectorized 399IO via a list of ``Chunk`` s. While this does carry an additional overhead, it 400is strongly motivated by the desire to carry data throughout the communications 401stack without needing a copy. By carrying a list of ``Chunk`` s, ``MultiBuf`` 402allows data to be prefixed, suffixed, infixed, or split without incurring the 403overhead of a separate allocation and copy. 404 405-------------------------------------------------------------------------- 406Managing Allocations with ``MultiBufAllocator`` and ``ChunkRegionTracker`` 407-------------------------------------------------------------------------- 408``pw::MultiBuf`` is not associated with a concrete allocator implementation. 409Instead, it delegates the creation of buffers to implementations of 410the ``MultiBufAllocator`` base class. This allows different allocator 411implementations to vend out ``pw::MultiBuf`` s that are optimized for use with a 412particular communications stack. 413 414For example, a communications stack which runs off of shared memory or specific 415DMA'able regions might choose to allocate memory out of those regions to allow 416for zero-copy writes. 417 418Additionally, custom allocator implementations can reserve headers, footers, or 419even split a ``pw::MultiBuf`` into multiple chunks in order to provide zero-copy 420fragmentation. 421 422Deallocation of these regions is managed through the ``ChunkRegionTracker``. 423When no more ``Chunk`` s associated with a ``ChunkRegionTracker`` exist, 424the ``ChunkRegionTracker`` receives a ``Destroy`` call to release both the 425region and the ``ChunkRegionTracker``. 426 427The ``MultiBufAllocator`` can place the ``ChunkRegionTracker`` wherever it 428wishes, including as a header or footer for the underlying region allocation. 429This is not required, however, as memory in regions like shared or DMA'able 430memory might be limited, in which case the ``ChunkRegionTracker`` can be 431stored elsewhere. 432 433----------------------------------------------------- 434Compatibility With Existing Communications Interfaces 435----------------------------------------------------- 436 437lwIP 438==== 439`Lightweight IP stack (lwIP) 440<https://www.nongnu.org/lwip>`_ 441is a TCP/IP implementation designed for use on small embedded systems. Some 442Pigweed platforms may choose to use lwIP as the backend for their ``Socket`` 443implementations, so it's important that ``pw::MultiBuf`` interoperate efficiently 444with their native buffer type. 445 446lwIP has its own buffer type, `pbuf 447<https://www.nongnu.org/lwip/2_1_x/structpbuf.html>`_ optimized for use with 448`zero-copy applications 449<https://www.nongnu.org/lwip/2_1_x/zerocopyrx.html>`_. 450``pbuf`` represents buffers as linked lists of reference-counted ``pbuf`` s 451which each have a pointer to their payload, a length, and some metadata. While 452this does not precisely match the representation of ``pw::MultiBuf``, it is 453possible to construct a ``pbuf`` list which points at the various chunks of a 454``pw::MultiBuf`` without needing to perform a copy of the data. 455 456Similarly, a ``pw::MultiBuf`` can refer to a ``pbuf`` by using each ``pbuf`` as 457a "``Chunk`` region", holding a reference count on the region's ``pbuf`` so 458long as any ``Chunk`` continues to reference the data referenced by that 459buffer. 460 461------------------ 462Existing Solutions 463------------------ 464 465Linux's ``sk_buff`` 466=================== 467Linux has a similar buffer structure called `sk_buff 468<https://docs.kernel.org/networking/skbuff.html#struct-sk-buff>`_. 469It differs in a few significant ways: 470 471It provides separate ``head``, ``data``, ``tail``, and ``end`` pointers. 472Other scatter-gather fragments are supplied using the ``frags[]`` structure. 473 474Separately, it has a ``frags_list`` of IP fragments which is created via calls to 475``ip_push_pending_frames``. Fragments are supplied by the ``frags[]`` payload in 476addition to the ``skb_shared_info.frag_list`` pointing to the tail. 477 478``sk_buff`` reference-counts not only the underlying chunks of memory, but also the 479``sk_buff`` structure itself. This allows for clones of ``sk_buff`` without 480manipulating the reference counts of the individual chunks. We anticipate that 481cloning a whole ``pw::buffers::MultiBuf`` will be uncommon enough that it is 482better to keep these structures single-owner (and mutable) rather than sharing 483and reference-counting them. 484 485FreeBSD's ``mbuf`` 486================== 487FreeBSD uses a design called `mbuf 488<https://man.freebsd.org/cgi/man.cgi?query=mbuf>`_ 489which interestingly allows data within the middle of a buffer to be given a 490specified alignment, allowing data to be prepended within the same buffer. 491This is similar to the structure of ``Chunk``, which may reference data in the 492middle of an allocated buffer, allowing prepending without a copy. 493