1===========
2NFS LOCALIO
3===========
4
5Overview
6========
7
8The LOCALIO auxiliary RPC protocol allows the Linux NFS client and
9server to reliably handshake to determine if they are on the same
10host. Select "NFS client and server support for LOCALIO auxiliary
11protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel
12config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled).
13
14Once an NFS client and server handshake as "local", the client will
15bypass the network RPC protocol for read, write and commit operations.
16Due to this XDR and RPC bypass, these operations will operate faster.
17
18The LOCALIO auxiliary protocol's implementation, which uses the same
19connection as NFS traffic, follows the pattern established by the NFS
20ACL protocol extension.
21
22The LOCALIO auxiliary protocol is needed to allow robust discovery of
23clients local to their servers. In a private implementation that
24preceded use of this LOCALIO protocol, a fragile sockaddr network
25address based match against all local network interfaces was attempted.
26But unlike the LOCALIO protocol, the sockaddr-based matching didn't
27handle use of iptables or containers.
28
29The robust handshake between local client and server is just the
30beginning, the ultimate use case this locality makes possible is the
31client is able to open files and issue reads, writes and commits
32directly to the server without having to go over the network. The
33requirement is to perform these loopback NFS operations as efficiently
34as possible, this is particularly useful for container use cases
35(e.g. kubernetes) where it is possible to run an IO job local to the
36server.
37
38The performance advantage realized from LOCALIO's ability to bypass
39using XDR and RPC for reads, writes and commits can be extreme, e.g.:
40
41fio for 20 secs with directio, qd of 8, 16 libaio threads:
42  - With LOCALIO:
43    4K read:    IOPS=979k,  BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec)
44    4K write:   IOPS=165k,  BW=646MiB/s  (678MB/s)(12.6GiB/20002msec)
45    128K read:  IOPS=402k,  BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec)
46    128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec)
47
48  - Without LOCALIO:
49    4K read:    IOPS=79.2k, BW=309MiB/s  (324MB/s)(6188MiB/20003msec)
50    4K write:   IOPS=59.8k, BW=234MiB/s  (245MB/s)(4671MiB/20002msec)
51    128K read:  IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec)
52    128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec)
53
54fio for 20 secs with directio, qd of 8, 1 libaio thread:
55  - With LOCALIO:
56    4K read:    IOPS=230k,  BW=898MiB/s  (941MB/s)(17.5GiB/20001msec)
57    4K write:   IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec)
58    128K read:  IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec)
59    128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec)
60
61  - Without LOCALIO:
62    4K read:    IOPS=77.1k, BW=301MiB/s  (316MB/s)(6022MiB/20001msec)
63    4K write:   IOPS=32.8k, BW=128MiB/s  (135MB/s)(2566MiB/20001msec)
64    128K read:  IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec)
65    128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec)
66
67FAQ
68===
69
701. What are the use cases for LOCALIO?
71
72   a. Workloads where the NFS client and server are on the same host
73      realize improved IO performance. In particular, it is common when
74      running containerised workloads for jobs to find themselves
75      running on the same host as the knfsd server being used for
76      storage.
77
782. What are the requirements for LOCALIO?
79
80   a. Bypass use of the network RPC protocol as much as possible. This
81      includes bypassing XDR and RPC for open, read, write and commit
82      operations.
83   b. Allow client and server to autonomously discover if they are
84      running local to each other without making any assumptions about
85      the local network topology.
86   c. Support the use of containers by being compatible with relevant
87      namespaces (e.g. network, user, mount).
88   d. Support all versions of NFS. NFSv3 is of particular importance
89      because it has wide enterprise usage and pNFS flexfiles makes use
90      of it for the data path.
91
923. Why doesn’t LOCALIO just compare IP addresses or hostnames when
93   deciding if the NFS client and server are co-located on the same
94   host?
95
96   Since one of the main use cases is containerised workloads, we cannot
97   assume that IP addresses will be shared between the client and
98   server. This sets up a requirement for a handshake protocol that
99   needs to go over the same connection as the NFS traffic in order to
100   identify that the client and the server really are running on the
101   same host. The handshake uses a secret that is sent over the wire,
102   and can be verified by both parties by comparing with a value stored
103   in shared kernel memory if they are truly co-located.
104
1054. Does LOCALIO improve pNFS flexfiles?
106
107   Yes, LOCALIO complements pNFS flexfiles by allowing it to take
108   advantage of NFS client and server locality.  Policy that initiates
109   client IO as closely to the server where the data is stored naturally
110   benefits from the data path optimization LOCALIO provides.
111
1125. Why not develop a new pNFS layout to enable LOCALIO?
113
114   A new pNFS layout could be developed, but doing so would put the
115   onus on the server to somehow discover that the client is co-located
116   when deciding to hand out the layout.
117   There is value in a simpler approach (as provided by LOCALIO) that
118   allows the NFS client to negotiate and leverage locality without
119   requiring more elaborate modeling and discovery of such locality in a
120   more centralized manner.
121
1226. Why is having the client perform a server-side file OPEN, without
123   using RPC, beneficial?  Is the benefit pNFS specific?
124
125   Avoiding the use of XDR and RPC for file opens is beneficial to
126   performance regardless of whether pNFS is used. Especially when
127   dealing with small files its best to avoid going over the wire
128   whenever possible, otherwise it could reduce or even negate the
129   benefits of avoiding the wire for doing the small file I/O itself.
130   Given LOCALIO's requirements the current approach of having the
131   client perform a server-side file open, without using RPC, is ideal.
132   If in the future requirements change then we can adapt accordingly.
133
1347. Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)?
135
136   Strong authentication is usually tied to the connection itself. It
137   works by establishing a context that is cached by the server, and
138   that acts as the key for discovering the authorisation token, which
139   can then be passed to rpc.mountd to complete the authentication
140   process. On the other hand, in the case of AUTH_UNIX, the credential
141   that was passed over the wire is used directly as the key in the
142   upcall to rpc.mountd. This simplifies the authentication process, and
143   so makes AUTH_UNIX easier to support.
144
1458. How do export options that translate RPC user IDs behave for LOCALIO
146   operations (eg. root_squash, all_squash)?
147
148   Export options that translate user IDs are managed by nfsd_setuser()
149   which is called by nfsd_setuser_and_check_port() which is called by
150   __fh_verify().  So they get handled exactly the same way for LOCALIO
151   as they do for non-LOCALIO.
152
1539. How does LOCALIO make certain that object lifetimes are managed
154   properly given NFSD and NFS operate in different contexts?
155
156   See the detailed "NFS Client and Server Interlock" section below.
157
158RPC
159===
160
161The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
162RPC method that allows the Linux NFS client to verify the local Linux
163NFS server can see the nonce (single-use UUID) the client generated and
164made available in nfs_common. This protocol isn't part of an IETF
165standard, nor does it need to be considering it is Linux-to-Linux
166auxiliary RPC protocol that amounts to an implementation detail.
167
168The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
169the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode
170XDR methods are used instead of the less efficient variable sized
171methods.
172
173The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned
174by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ):
175Linux Kernel Organization       400122  nfslocalio
176
177The LOCALIO protocol spec in rpcgen syntax is::
178
179  /* raw RFC 9562 UUID */
180  #define UUID_SIZE 16
181  typedef u8 uuid_t<UUID_SIZE>;
182
183  program NFS_LOCALIO_PROGRAM {
184      version LOCALIO_V1 {
185          void
186              NULL(void) = 0;
187
188          void
189              UUID_IS_LOCAL(uuid_t) = 1;
190      } = 1;
191  } = 400122;
192
193LOCALIO uses the same transport connection as NFS traffic. As such,
194LOCALIO is not registered with rpcbind.
195
196NFS Common and Client/Server Handshake
197======================================
198
199fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client
200to generate a nonce (single-use UUID) and associated short-lived
201nfs_uuid_t struct, register it with nfs_common for subsequent lookup and
202verification by the NFS server and if matched the NFS server populates
203members in the nfs_uuid_t struct. The NFS client then uses nfs_common to
204transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv
205clients_list from the nfs_common's uuids_list.  See:
206fs/nfs/localio.c:nfs_local_probe()
207
208nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such
209it has members that point to nfsd memory for direct use by the client
210(e.g. 'net' is the server's network namespace, through it the client can
211access nn->nfsd_serv with proper rcu read access). It is this client
212and server synchronization that enables advanced usage and lifetime of
213objects to span from the host kernel's nfsd to per-container knfsd
214instances that are connected to nfs client's running on the same local
215host.
216
217NFS Client and Server Interlock
218===============================
219
220LOCALIO provides the nfs_uuid_t object and associated interfaces to
221allow proper network namespace (net-ns) and NFSD object refcounting.
222
223LOCALIO required the introduction and use of NFSD's percpu nfsd_net_ref
224to interlock nfsd_shutdown_net() and nfsd_open_local_fh(), to ensure
225each net-ns is not destroyed while in use by nfsd_open_local_fh(), and
226warrants a more detailed explanation:
227
228    nfsd_open_local_fh() uses nfsd_net_try_get() before opening its
229    nfsd_file handle and then the caller (NFS client) must drop the
230    reference for the nfsd_file and associated net-ns using
231    nfsd_file_put_local() once it has completed its IO.
232
233    This interlock working relies heavily on nfsd_open_local_fh() being
234    afforded the ability to safely deal with the possibility that the
235    NFSD's net-ns (and nfsd_net by association) may have been destroyed
236    by nfsd_destroy_serv() via nfsd_shutdown_net().
237
238This interlock of the NFS client and server has been verified to fix an
239easy to hit crash that would occur if an NFSD instance running in a
240container, with a LOCALIO client mounted, is shutdown. Upon restart of
241the container and associated NFSD, the client would go on to crash due
242to NULL pointer dereference that occurred due to the LOCALIO client's
243attempting to nfsd_open_local_fh() without having a proper reference on
244NFSD's net-ns.
245
246NFS Client issues IO instead of Server
247======================================
248
249Because LOCALIO is focused on protocol bypass to achieve improved IO
250performance, alternatives to the traditional NFS wire protocol (SUNRPC
251with XDR) must be provided to access the backing filesystem.
252
253See fs/nfs/localio.c:nfs_local_open_fh() and
254fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes
255focused use of select nfs server objects to allow a client local to a
256server to open a file pointer without needing to go over the network.
257
258The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the
259server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access
260both the associated nfsd network namespace and nn->nfsd_serv in terms of
261RCU. If nfsd_open_local_fh() finds that the client no longer sees valid
262nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO
263to nfs_local_open_fh() and the client will try to reestablish the
264LOCALIO resources needed by calling nfs_local_probe() again. This
265recovery is needed if/when an nfsd instance running in a container were
266to reboot while a LOCALIO client is connected to it.
267
268Once the client has an open nfsd_file pointer it will issue reads,
269writes and commits directly to the underlying local filesystem (normally
270done by the nfs server). As such, for these operations, the NFS client
271is issuing IO to the underlying local filesystem that it is sharing with
272the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and
273fs/nfs/localio.c:nfs_local_commit().
274
275With normal NFS that makes use of RPC to issue IO to the server, if an
276application uses O_DIRECT the NFS client will bypass the pagecache but
277the NFS server will not. The NFS server's use of buffered IO affords
278applications to be less precise with their alignment when issuing IO to
279the NFS client. But if all applications properly align their IO, LOCALIO
280can be configured to use end-to-end O_DIRECT semantics from the NFS
281client to the underlying local filesystem, that it is sharing with
282the NFS server, by setting the 'localio_O_DIRECT_semantics' nfs module
283parameter to Y, e.g.:
284
285    echo Y > /sys/module/nfs/parameters/localio_O_DIRECT_semantics
286
287Once enabled, it will cause LOCALIO to use end-to-end O_DIRECT semantics
288(but again, this may cause IO to fail if applications do not properly
289align their IO).
290
291Security
292========
293
294LOCALIO is only supported when UNIX-style authentication (AUTH_UNIX, aka
295AUTH_SYS) is used.
296
297Care is taken to ensure the same NFS security mechanisms are used
298(authentication, etc) regardless of whether LOCALIO or regular NFS
299access is used. The auth_domain established as part of the traditional
300NFS client access to the NFS server is also used for LOCALIO.
301
302Relative to containers, LOCALIO gives the client access to the network
303namespace the server has. This is required to allow the client to access
304the server's per-namespace nfsd_net struct. With traditional NFS, the
305client is afforded this same level of access (albeit in terms of the NFS
306protocol via SUNRPC). No other namespaces (user, mount, etc) have been
307altered or purposely extended from the server to the client.
308
309Module Parameters
310=================
311
312/sys/module/nfs/parameters/localio_enabled (bool)
313controls if LOCALIO is enabled, defaults to Y. If client and server are
314local but 'localio_enabled' is set to N then LOCALIO will not be used.
315
316/sys/module/nfs/parameters/localio_O_DIRECT_semantics (bool)
317controls if O_DIRECT extends down to the underlying filesystem, defaults
318to N. Application IO must be logical blocksize aligned, otherwise
319O_DIRECT will fail.
320
321/sys/module/nfsv3/parameters/nfs3_localio_probe_throttle (uint)
322controls if NFSv3 read and write IOs will trigger (re)enabling of
323LOCALIO every N (nfs3_localio_probe_throttle) IOs, defaults to 0
324(disabled). Must be power-of-2, admin keeps all the pieces if they
325misconfigure (too low a value or non-power-of-2).
326
327Testing
328=======
329
330The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write
331and commit access have proven stable against various test scenarios:
332
333- Client and server both on the same host.
334
335- All permutations of client and server support enablement for both
336  local and remote client and server.
337
338- Testing against NFS storage products that don't support the LOCALIO
339  protocol was also performed.
340
341- Client on host, server within a container (for both v3 and v4.2).
342  The container testing was in terms of podman managed containers and
343  includes successful container stop/restart scenario.
344
345- Formalizing these test scenarios in terms of existing test
346  infrastructure is on-going. Initial regular coverage is provided in
347  terms of ktest running xfstests against a LOCALIO-enabled NFS loopback
348  mount configuration, and includes lockdep and KASAN coverage, see:
349  https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next
350  https://github.com/koverstreet/ktest
351
352- Various kdevops testing (in terms of "Chuck's BuildBot") has been
353  performed to regularly verify the LOCALIO changes haven't caused any
354  regressions to non-LOCALIO NFS use cases.
355
356- All of Hammerspace's various sanity tests pass with LOCALIO enabled
357  (this includes numerous pNFS and flexfiles tests).
358