1.. SPDX-License-Identifier: GPL-2.0 2 3================================= 4NVMe PCI Endpoint Function Target 5================================= 6 7:Author: Damien Le Moal <[email protected]> 8 9The NVMe PCI endpoint function target driver implements a NVMe PCIe controller 10using a NVMe fabrics target controller configured with the PCI transport type. 11 12Overview 13======== 14 15The NVMe PCI endpoint function target driver allows exposing a NVMe target 16controller over a PCIe link, thus implementing an NVMe PCIe device similar to a 17regular M.2 SSD. The target controller is created in the same manner as when 18using NVMe over fabrics: the controller represents the interface to an NVMe 19subsystem using a port. The port transfer type must be configured to be 20"pci". The subsystem can be configured to have namespaces backed by regular 21files or block devices, or can use NVMe passthrough to expose to the PCI host an 22existing physical NVMe device or a NVMe fabrics host controller (e.g. a NVMe TCP 23host controller). 24 25The NVMe PCI endpoint function target driver relies as much as possible on the 26NVMe target core code to parse and execute NVMe commands submitted by the PCIe 27host. However, using the PCI endpoint framework API and DMA API, the driver is 28also responsible for managing all data transfers over the PCIe link. This 29implies that the NVMe PCI endpoint function target driver implements several 30NVMe data structure management and some NVMe command parsing. 31 321) The driver manages retrieval of NVMe commands in submission queues using DMA 33 if supported, or MMIO otherwise. Each command retrieved is then executed 34 using a work item to maximize performance with the parallel execution of 35 multiple commands on different CPUs. The driver uses a work item to 36 constantly poll the doorbell of all submission queues to detect command 37 submissions from the PCIe host. 38 392) The driver transfers completion queues entries of completed commands to the 40 PCIe host using MMIO copy of the entries in the host completion queue. 41 After posting completion entries in a completion queue, the driver uses the 42 PCI endpoint framework API to raise an interrupt to the host to signal the 43 commands completion. 44 453) For any command that has a data buffer, the NVMe PCI endpoint target driver 46 parses the command PRPs or SGLs lists to create a list of PCI address 47 segments representing the mapping of the command data buffer on the host. 48 The command data buffer is transferred over the PCIe link using this list of 49 PCI address segments using DMA, if supported. If DMA is not supported, MMIO 50 is used, which results in poor performance. For write commands, the command 51 data buffer is transferred from the host into a local memory buffer before 52 executing the command using the target core code. For read commands, a local 53 memory buffer is allocated to execute the command and the content of that 54 buffer is transferred to the host once the command completes. 55 56Controller Capabilities 57----------------------- 58 59The NVMe capabilities exposed to the PCIe host through the BAR 0 registers 60are almost identical to the capabilities of the NVMe target controller 61implemented by the target core code. There are some exceptions. 62 631) The NVMe PCI endpoint target driver always sets the controller capability 64 CQR bit to request "Contiguous Queues Required". This is to facilitate the 65 mapping of a queue PCI address range to the local CPU address space. 66 672) The doorbell stride (DSTRB) is always set to be 4B 68 693) Since the PCI endpoint framework does not provide a way to handle PCI level 70 resets, the controller capability NSSR bit (NVM Subsystem Reset Supported) 71 is always cleared. 72 734) The boot partition support (BPS), Persistent Memory Region Supported (PMRS) 74 and Controller Memory Buffer Supported (CMBS) capabilities are never 75 reported. 76 77Supported Features 78------------------ 79 80The NVMe PCI endpoint target driver implements support for both PRPs and SGLs. 81The driver also implements IRQ vector coalescing and submission queue 82arbitration burst. 83 84The maximum number of queues and the maximum data transfer size (MDTS) are 85configurable through configfs before starting the controller. To avoid issues 86with excessive local memory usage for executing commands, MDTS defaults to 512 87KB and is limited to a maximum of 2 MB (arbitrary limit). 88 89Mimimum number of PCI Address Mapping Windows Required 90------------------------------------------------------ 91 92Most PCI endpoint controllers provide a limited number of mapping windows for 93mapping a PCI address range to local CPU memory addresses. The NVMe PCI 94endpoint target controllers uses mapping windows for the following. 95 961) One memory window for raising MSI or MSI-X interrupts 972) One memory window for MMIO transfers 983) One memory window for each completion queue 99 100Given the highly asynchronous nature of the NVMe PCI endpoint target driver 101operation, the memory windows as described above will generally not be used 102simultaneously, but that may happen. So a safe maximum number of completion 103queues that can be supported is equal to the total number of memory mapping 104windows of the PCI endpoint controller minus two. E.g. for an endpoint PCI 105controller with 32 outbound memory windows available, up to 30 completion 106queues can be safely operated without any risk of getting PCI address mapping 107errors due to the lack of memory windows. 108 109Maximum Number of Queue Pairs 110----------------------------- 111 112Upon binding of the NVMe PCI endpoint target driver to the PCI endpoint 113controller, BAR 0 is allocated with enough space to accommodate the admin queue 114and multiple I/O queues. The maximum of number of I/O queues pairs that can be 115supported is limited by several factors. 116 1171) The NVMe target core code limits the maximum number of I/O queues to the 118 number of online CPUs. 1192) The total number of queue pairs, including the admin queue, cannot exceed 120 the number of MSI-X or MSI vectors available. 1213) The total number of completion queues must not exceed the total number of 122 PCI mapping windows minus 2 (see above). 123 124The NVMe endpoint function driver allows configuring the maximum number of 125queue pairs through configfs. 126 127Limitations and NVMe Specification Non-Compliance 128------------------------------------------------- 129 130Similar to the NVMe target core code, the NVMe PCI endpoint target driver does 131not support multiple submission queues using the same completion queue. All 132submission queues must specify a unique completion queue. 133 134 135User Guide 136========== 137 138This section describes the hardware requirements and how to setup an NVMe PCI 139endpoint target device. 140 141Kernel Requirements 142------------------- 143 144The kernel must be compiled with the configuration options CONFIG_PCI_ENDPOINT, 145CONFIG_PCI_ENDPOINT_CONFIGFS, and CONFIG_NVME_TARGET_PCI_EPF enabled. 146CONFIG_PCI, CONFIG_BLK_DEV_NVME and CONFIG_NVME_TARGET must also be enabled 147(obviously). 148 149In addition to this, at least one PCI endpoint controller driver should be 150available for the endpoint hardware used. 151 152To facilitate testing, enabling the null-blk driver (CONFIG_BLK_DEV_NULL_BLK) 153is also recommended. With this, a simple setup using a null_blk block device 154as a subsystem namespace can be used. 155 156Hardware Requirements 157--------------------- 158 159To use the NVMe PCI endpoint target driver, at least one endpoint controller 160device is required. 161 162To find the list of endpoint controller devices in the system:: 163 164 # ls /sys/class/pci_epc/ 165 a40000000.pcie-ep 166 167If PCI_ENDPOINT_CONFIGFS is enabled:: 168 169 # ls /sys/kernel/config/pci_ep/controllers 170 a40000000.pcie-ep 171 172The endpoint board must of course also be connected to a host with a PCI cable 173with RX-TX signal swapped. If the host PCI slot used does not have 174plug-and-play capabilities, the host should be powered off when the NVMe PCI 175endpoint device is configured. 176 177NVMe Endpoint Device 178-------------------- 179 180Creating an NVMe endpoint device is a two step process. First, an NVMe target 181subsystem and port must be defined. Second, the NVMe PCI endpoint device must 182be setup and bound to the subsystem and port created. 183 184Creating a NVMe Subsystem and Port 185---------------------------------- 186 187Details about how to configure a NVMe target subsystem and port are outside the 188scope of this document. The following only provides a simple example of a port 189and subsystem with a single namespace backed by a null_blk device. 190 191First, make sure that configfs is enabled:: 192 193 # mount -t configfs none /sys/kernel/config 194 195Next, create a null_blk device (default settings give a 250 GB device without 196memory backing). The block device created will be /dev/nullb0 by default:: 197 198 # modprobe null_blk 199 # ls /dev/nullb0 200 /dev/nullb0 201 202The NVMe PCI endpoint function target driver must be loaded:: 203 204 # modprobe nvmet_pci_epf 205 # lsmod | grep nvmet 206 nvmet_pci_epf 32768 0 207 nvmet 118784 1 nvmet_pci_epf 208 nvme_core 131072 2 nvmet_pci_epf,nvmet 209 210Now, create a subsystem and a port that we will use to create a PCI target 211controller when setting up the NVMe PCI endpoint target device. In this 212example, the port is created with a maximum of 4 I/O queue pairs:: 213 214 # cd /sys/kernel/config/nvmet/subsystems 215 # mkdir nvmepf.0.nqn 216 # echo -n "Linux-pci-epf" > nvmepf.0.nqn/attr_model 217 # echo "0x1b96" > nvmepf.0.nqn/attr_vendor_id 218 # echo "0x1b96" > nvmepf.0.nqn/attr_subsys_vendor_id 219 # echo 1 > nvmepf.0.nqn/attr_allow_any_host 220 # echo 4 > nvmepf.0.nqn/attr_qid_max 221 222Next, create and enable the subsystem namespace using the null_blk block 223device:: 224 225 # mkdir nvmepf.0.nqn/namespaces/1 226 # echo -n "/dev/nullb0" > nvmepf.0.nqn/namespaces/1/device_path 227 # echo 1 > "nvmepf.0.nqn/namespaces/1/enable" 228 229Finally, create the target port and link it to the subsystem:: 230 231 # cd /sys/kernel/config/nvmet/ports 232 # mkdir 1 233 # echo -n "pci" > 1/addr_trtype 234 # ln -s /sys/kernel/config/nvmet/subsystems/nvmepf.0.nqn \ 235 /sys/kernel/config/nvmet/ports/1/subsystems/nvmepf.0.nqn 236 237Creating a NVMe PCI Endpoint Device 238----------------------------------- 239 240With the NVMe target subsystem and port ready for use, the NVMe PCI endpoint 241device can now be created and enabled. The NVMe PCI endpoint target driver 242should already be loaded (that is done automatically when the port is created):: 243 244 # ls /sys/kernel/config/pci_ep/functions 245 nvmet_pci_epf 246 247Next, create function 0:: 248 249 # cd /sys/kernel/config/pci_ep/functions/nvmet_pci_epf 250 # mkdir nvmepf.0 251 # ls nvmepf.0/ 252 baseclass_code msix_interrupts secondary 253 cache_line_size nvme subclass_code 254 deviceid primary subsys_id 255 interrupt_pin progif_code subsys_vendor_id 256 msi_interrupts revid vendorid 257 258Configure the function using any device ID (the vendor ID for the device will 259be automatically set to the same value as the NVMe target subsystem vendor 260ID):: 261 262 # cd /sys/kernel/config/pci_ep/functions/nvmet_pci_epf 263 # echo 0xBEEF > nvmepf.0/deviceid 264 # echo 32 > nvmepf.0/msix_interrupts 265 266If the PCI endpoint controller used does not support MSI-X, MSI can be 267configured instead:: 268 269 # echo 32 > nvmepf.0/msi_interrupts 270 271Next, let's bind our endpoint device with the target subsystem and port that we 272created:: 273 274 # echo 1 > nvmepf.0/nvme/portid 275 # echo "nvmepf.0.nqn" > nvmepf.0/nvme/subsysnqn 276 277The endpoint function can then be bound to the endpoint controller and the 278controller started:: 279 280 # cd /sys/kernel/config/pci_ep 281 # ln -s functions/nvmet_pci_epf/nvmepf.0 controllers/a40000000.pcie-ep/ 282 # echo 1 > controllers/a40000000.pcie-ep/start 283 284On the endpoint machine, kernel messages will show information as the NVMe 285target device and endpoint device are created and connected. 286 287.. code-block:: text 288 289 null_blk: disk nullb0 created 290 null_blk: module loaded 291 nvmet: adding nsid 1 to subsystem nvmepf.0.nqn 292 nvmet_pci_epf nvmet_pci_epf.0: PCI endpoint controller supports MSI-X, 32 vectors 293 nvmet: Created nvm controller 1 for subsystem nvmepf.0.nqn for NQN nqn.2014-08.org.nvmexpress:uuid:2ab90791-2246-4fbb-961d-4c3d5a5a0176. 294 nvmet_pci_epf nvmet_pci_epf.0: New PCI ctrl "nvmepf.0.nqn", 4 I/O queues, mdts 524288 B 295 296PCI Root-Complex Host 297--------------------- 298 299Booting the PCI host will result in the initialization of the PCIe link (this 300may be signaled by the PCI endpoint driver with a kernel message). A kernel 301message on the endpoint will also signal when the host NVMe driver enables the 302device controller:: 303 304 nvmet_pci_epf nvmet_pci_epf.0: Enabling controller 305 306On the host side, the NVMe PCI endpoint function target device will is 307discoverable as a PCI device, with the vendor ID and device ID as configured:: 308 309 # lspci -n 310 0000:01:00.0 0108: 1b96:beef 311 312An this device will be recognized as an NVMe device with a single namespace:: 313 314 # lsblk 315 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS 316 nvme0n1 259:0 0 250G 0 disk 317 318The NVMe endpoint block device can then be used as any other regular NVMe 319namespace block device. The *nvme* command line utility can be used to get more 320detailed information about the endpoint device:: 321 322 # nvme id-ctrl /dev/nvme0 323 NVME Identify Controller: 324 vid : 0x1b96 325 ssvid : 0x1b96 326 sn : 94993c85650ef7bcd625 327 mn : Linux-pci-epf 328 fr : 6.13.0-r 329 rab : 6 330 ieee : 000000 331 cmic : 0xb 332 mdts : 7 333 cntlid : 0x1 334 ver : 0x20100 335 ... 336 337 338Endpoint Bindings 339================= 340 341The NVMe PCI endpoint target driver uses the PCI endpoint configfs device 342attributes as follows. 343 344================ =========================================================== 345vendorid Ignored (the vendor id of the NVMe target subsystem is used) 346deviceid Anything is OK (e.g. PCI_ANY_ID) 347revid Do not care 348progif_code Must be 0x02 (NVM Express) 349baseclass_code Must be 0x01 (PCI_BASE_CLASS_STORAGE) 350subclass_code Must be 0x08 (Non-Volatile Memory controller) 351cache_line_size Do not care 352subsys_vendor_id Ignored (the subsystem vendor id of the NVMe target subsystem 353 is used) 354subsys_id Anything is OK (e.g. PCI_ANY_ID) 355msi_interrupts At least equal to the number of queue pairs desired 356msix_interrupts At least equal to the number of queue pairs desired 357interrupt_pin Interrupt PIN to use if MSI and MSI-X are not supported 358================ =========================================================== 359 360The NVMe PCI endpoint target function also has some specific configurable 361fields defined in the *nvme* subdirectory of the function directory. These 362fields are as follows. 363 364================ =========================================================== 365mdts_kb Maximum data transfer size in KiB (default: 512) 366portid The ID of the target port to use 367subsysnqn The NQN of the target subsystem to use 368================ =========================================================== 369