Lines Matching +full:network +full:- +full:oriented

5 This write-up is based on three articles published at lwn.net:
7 - <https://lwn.net/Articles/649115/> Pathname lookup in Linux
8 - <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
9 - <https://lwn.net/Articles/650786/> A walk among the symlinks
15 - per-directory parallel name lookup.
16 - ``openat2()`` resolution restriction flags.
27 the early parts of the analysis we will divide off symlinks - leaving
30 will allow us to review "REF-walk" and "RCU-walk" separately. But we
35 --------------------------
37 .. _openat: http://man7.org/linux/man-pages/man2/openat.2.html
43 non-"``/``" characters. These form two kinds of paths. Those that
49 .. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html
81 A pathname that contains at least one non-<slash> character and
106 ----------------------
135 These are typically filesystems that are shared across a network,
142 REF-walk: simple concurrency management with refcounts and spinlocks
143 --------------------------------------------------------------------
148 pathname, and focus on the "REF-walk" approach to concurrency
151 (indicating the use of RCU-walk) is set.
155 REF-walk is fairly heavy-handed with locks and reference counts. Not
156 as heavy-handed as in the old "big kernel lock" days, but certainly not
162 The locking mechanisms used by REF-walk include:
164 dentry->d_lockref
168 reference count. The special-sauce of this primitive is that the
174 will behave as expected. It also protects the ``->d_inode`` reference
193 So as long as a counted reference is held to a dentry, a non-``NULL`` ``->d_inode``
196 dentry->d_lock
205 When looking for a name in a directory, REF-walk takes ``d_lock`` on
211 REF-walk can take ``d_lock`` to get a stable reference to ``d_parent``,
232 The name-lookup process (``d_lookup()``) does *not* try to prevent this
244 ``-EAGAIN``.
246 inode->i_rwsem
279 If two threads attempt to look up the same name at the same time - a
280 name that is not yet in the dcache - the shared lock on ``i_rwsem`` will
284 per-dentry flag bit (``DCACHE_PAR_LOOKUP``).
302 dentry from the secondary hash table - it will normally have been
321 mnt->mnt_count
324 ``mnt_count`` is a per-CPU reference counter on "``mount``" structures.
325 Per-CPU here means that incrementing the count is cheap as it only
326 uses CPU-local memory, but checking if the count is zero is expensive as
331 in particular, doesn't stabilize the link to the mounted-on dentry. It
334 filesystem. So a reference through ``->mnt_count`` provides a stable
335 reference to the mounted dentry, but not the mounted-on dentry.
356 needed to stabilize the link to the mounted-on dentry, which the
364 ``-EAGAIN``.
377 ----------------------------------------------
379 .. _First edition Unix: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
382 in a ``struct nameidata``, "namei" being the traditional name - dating
383 all the way back to `First Edition Unix`_ - of the function that
415 only assigned the first time it is used, or when a non-standard root
463 This "hand-over-hand" sequencing of getting a reference to the new
466 analogue in the "RCU-walk" version.
469 ----------------------------
471 ``link_path_walk()`` only walks as far as setting ``nd->last`` and
472 ``nd->last_type`` to refer to the final component of the path. It does
479 ``path_parentat()`` is clearly the simplest - it just wraps a little bit
487 ``path_lookupat()`` is nearly as simple - it is used when an existing
515 ---------------------------
517 Apart from symbolic links, there are only two parts of the "REF-walk"
522 ``->d_revalidate()`` dentry method to ensure that the cached information
540 to three different flags that might be set in ``dentry->d_flags``:
559 ``d_manage()`` by returning ``-EISDIR``.
569 If this flag is set, and ``d_manage()`` didn't return ``-EISDIR``,
592 This will become more important next time when we examine RCU-walk
595 RCU-walk - faster pathname lookup in Linux
598 RCU-walk is another algorithm for performing pathname lookup in Linux.
599 It is in many ways similar to REF-walk and the two share quite a bit
600 of code. The significant difference in RCU-walk is how it allows for
603 We noted that REF-walk is complex because there are numerous details
604 and special cases. RCU-walk reduces this complexity by simply
605 refusing to handle a number of cases -- it instead falls back to
606 REF-walk. The difficulty with RCU-walk comes from a different
612 --------------------------
625 The REF-walk mechanism already described certainly doesn't follow this
627 be other threads modifying the data. RCU-walk, in contrast, is
631 other parts it is important that RCU-walk can quickly fall back to
632 using REF-walk.
634 Pathname lookup always starts in RCU-walk mode but only remains there
640 REF-walk.
643 ``vfsmount`` and ``dentry``, and ensuring that these are still valid -
644 that a path walk with REF-walk would have found the same entries.
645 This is an invariant that RCU-walk must guarantee. It can only make
647 REF-walk could also have made if it were walking down the tree at the
649 processed with the reliable, if slightly sluggish, REF-walk. If
650 RCU-walk finds it cannot stop gracefully, it simply gives up and
651 restarts from the top with REF-walk.
653 This pattern of "try RCU-walk, if that fails try REF-walk" can be
660 They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If
662 special flag to request "REF-walk". If either of those report the
665 revalidated - normally entries are only revalidated if the filesystem
669 REF-walk, but will never then try to switch back to RCU-walk. Places
670 that trip up RCU-walk are much more likely to be near the leaves and
675 --------------------------------
677 RCU is, unsurprisingly, critical to RCU-walk mode. The
678 ``rcu_read_lock()`` is held for the entire time that RCU-walk is walking
680 data structures - dentries, inodes, super_blocks, and mounts - will
687 As we saw above, REF-walk holds a counted reference to the current
691 taken to prevent certain changes from happening. RCU-walk must not
696 To preserve the invariant mentioned above (that RCU-walk may only make
697 decisions that REF-walk could have made), it must make the checks at
698 or near the same places that REF-walk holds the references. So, when
699 REF-walk increments a reference count or takes a spinlock, RCU-walk
701 similar function. When REF-walk decrements the count or drops the
702 lock, RCU-walk checks if the sampled status is still valid using
706 RCU-walk accesses two different fields in a seqlock-protected
709 is needed - which it usually is - RCU-walk must take a copy and then
713 imposes a memory barrier so that no memory-read instruction from
717 byte-wise name equality, calls into the filesystem to compare a name
720 are consistent, and only then is ``->d_compare()`` called. When
728 the bigger picture of how RCU-walk uses seqlocks.
730 ``mount_lock`` and ``nd->m_seq``
733 We already met the ``mount_lock`` seqlock when REF-walk used it to
734 ensure that crossing a mount point is performed safely. RCU-walk uses
738 descends the tree, RCU-walk samples the state of ``mount_lock`` at the
743 relatively rare, it is reasonable to fall back on REF-walk any time
746 ``m_seq`` is checked (using ``read_seqretry()``) at the end of an RCU-walk
747 sequence, whether switching to REF-walk for the rest of the path or
751 whole RCU-walk sequence is aborted and the path is processed again by
752 REF-walk.
754 If RCU-walk finds that ``mount_lock`` hasn't changed then it can be sure
755 that, had REF-walk taken counted references on each vfsmount, the
759 ``dentry->d_seq`` and ``nd->seq``
762 In place of taking a count or lock on ``d_reflock``, RCU-walk samples
763 the per-dentry ``d_seq`` seqlock, and stores the sequence number in the
764 ``seq`` field of the nameidata structure, so ``nd->seq`` should always be
765 the current sequence number of ``nd->dentry``. This number needs to be
775 ``mnt->mnt_mountpoint`` link to get a new dentry and collect its
778 mount point and follow the ``mnt->mnt_root`` link. This would imply a
783 The inode pointer, stored in ``->d_inode``, is a little more
791 ``lookup_fast()`` is the only lookup routine that is used in RCU-mode,
804 the old dentry which we saw in REF-walk.
806 No ``inode->i_rwsem`` or even ``rename_lock``
811 ``inode->i_rwsem`` plays no role in RCU-walk. If some other thread does
812 take ``i_rwsem`` and modifies the directory in a way that RCU-walk needs
813 to notice, the result will be either that RCU-walk fails to find the
816 REF-walk mode which can take whatever locks are needed.
818 Though ``rename_lock`` could be used by RCU-walk as it doesn't require
819 any sleeping, RCU-walk doesn't bother. REF-walk uses ``rename_lock`` to
822 something that actually is there. When RCU-walk fails to find
824 already drops down to REF-walk and tries again with appropriate
829 -----------------------------------------
831 That "dropping down to REF-walk" typically involves a call to
832 ``unlazy_walk()``, so named because "RCU-walk" is also sometimes
844 Other reasons for dropping out of RCU-walk that do not trigger a call
848 will return ``-ECHILD`` which will percolate up until it triggers a new
849 attempt from the top using REF-walk.
855 it, too, aborts with ``-ECHILD``, otherwise the transition to REF-walk
862 all. For ``dentry->d_lockref``, it is safe to increment the reference
864 "dead" which involves setting the counter to ``-128``.
867 For ``mnt->mnt_count`` it is safe to take a reference as long as
870 the standard way of calling ``mnt_put()`` - an unmount may have
878 --------------------------
880 RCU-walk depends almost entirely on cached information and often will
882 besides the already-mentioned component-name comparison, where the
883 file system might be included in RCU-walk, and it must know to be
886 If the filesystem has non-standard permission-checking requirements -
888 - the ``i_op->permission`` interface might be called during RCU-walk.
890 knows not to sleep, but to return ``-ECHILD`` if it cannot complete
891 promptly. ``i_op->permission`` is given the inode pointer, not the
901 ``d_op->d_revalidate`` may be called in RCU-walk too. This interface
910 ------------------
912 In various places in the details of REF-walk and RCU-walk, and also in
917 can see that in the high-level approach of first trying RCU-walk and
918 then trying REF-walk, and in places where ``unlazy_walk()`` is used to
919 switch to REF-walk for the rest of the path. We also saw it earlier
924 again - repeatedly". This is seen with the use of ``rename_lock`` and
925 ``mount_lock`` in REF-walk. RCU-walk doesn't make use of this pattern -
944 Then a consideration of access-time updates and summary of the various
948 -----------------
961 "``readlink -f``" command does, though it also edits out "``.``" and
978 successfully - the error ``ELOOP`` must be returned. Loops can be
987 true loops, but also to "very deep" non-loops. It's not about memory
1012 ---------------------------------------
1016 to external storage. It is particularly important for RCU-walk to be
1018 it doesn't need to drop down into REF-walk.
1020 .. _object-oriented design pattern: https://lwn.net/Articles/446317/
1026 common `object-oriented design pattern`_ in the kernel). This will
1053 Taking a reference to a cache page is often possible even in RCU-walk
1056 out of RCU-walk mode completely. Even filesystems that allocate
1058 allocate memory without the need to drop out of RCU-walk. If a
1059 filesystem cannot successfully get a reference in RCU-walk mode, it
1060 must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to
1061 REF-walk mode in which the filesystem is allowed to sleep.
1063 The place for all this to happen is the ``i_op->get_link()`` inode
1064 method. This is called both in RCU-walk and REF-walk. In RCU-walk the
1065 ``dentry*`` argument is NULL, ``->get_link()`` can return -ECHILD to drop out of
1066 RCU-walk. Much like the ``i_op->permission()`` method we
1067 looked at previously, ``->get_link()`` would need to be careful that
1070 ``struct delayed_called`` will be passed to ``->get_link()``:
1076 whether in RCU-walk or REF-walk, the symlink stack needs to contain,
1079 - the ``struct path`` to provide a reference to the previous path
1080 - the ``const char *`` to provide a reference to the to previous name
1081 - the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk
1082 - the ``struct delayed_call`` for later invocation.
1086 remnant). On a 64-bit system, this is about 40 bytes per entry;
1096 ---------------------
1099 components in the path and all of the non-final symlinks. As symlinks
1116 want to pop the symlink-just-completed off the stack before pushing
1117 the symlink-just-found to avoid leaving empty path remnants that would
1137 A pair of special-case symlinks deserve a little further explanation.
1149 aren't really (and are therefore commonly referred to as "magic-links")::
1151 $ ls -l /proc/self/fd/1
1152 lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
1157 objects you get a name that might refer to the same file - unless it
1159 one of these, the ``->get_link()`` method in "procfs" doesn't return
1161 ``nameidata`` in place to point to that target. ``->get_link()`` then
1166 --------------------------------------------
1203 will perform the separate ``i_op->lookup()`` and ``i_op->create()`` steps
1208 2. vfs_open() can fail with ``-EOPENSTALE`` if the cached information
1209 wasn't quite current enough. If it's in RCU-walk ``-ECHILD`` will be returned
1210 otherwise ``-ESTALE`` is returned. When ``-ESTALE`` is returned, the caller may
1216 ln -s bar /tmp/foo
1221 like for a non-creating open: lookup_last() or open_last_lookup()
1226 ------------------------
1228 We previously said of RCU-walk that it would "take no locks, increment
1251 care about access-time updates during pathname lookup.
1260 quite complex. Trying to stay in RCU-walk while doing it is best
1266 ``relatime``, many filesystems record ``atime`` with a one-second
1269 It is easy to test if an ``atime`` update is needed while in RCU-walk
1270 mode and, if it isn't, the update can be skipped and RCU-walk mode
1272 path walk drop down to REF-walk. All of this is handled in the
1276 -----------
1294 to lookup: RCU-walk, REF-walk, and REF-walk with forced revalidation.
1310 to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
1311 ``ND_JUMPED`` is set when the look completes - which may be at the
1314 Resolution-restriction flags
1322 ``LOOKUP_NO_SYMLINKS`` blocks all symlink traversals (including magic-links).
1326 ``LOOKUP_NO_MAGICLINKS`` blocks all magic-link traversals. Filesystems must
1328 ``LOOKUP_NO_MAGICLINKS`` and other magic-link restrictions are implemented.
1331 bind-mounts and ordinary mounts). Note that the ``vfsmount`` which contains the
1332 lookup is determined by the first mountpoint the path lookup reaches --
1334 with the ``dfd``'s ``vfsmount``. Magic-links are only permitted if the
1341 resolution of "..". Magic-links are also blocked.
1345 the starting point, and ".." at the starting point will act as a no-op. As with
1347 attacks against ".." resolution. Magic-links are also blocked.
1349 Final-component flags
1361 "``mount --bind``".
1375 available to the filesystem and particularly the ``->d_revalidate()``
1378 These flags were previously useful for ``->lookup()`` too but with the
1379 introduction of ``->atomic_open()`` they are less relevant there.
1382 ---------------
1385 in good shape - various parts are certainly easier to understand now
1387 "finished". As already mentioned, RCU-walk currently only follows