Lines Matching +full:link +full:- +full:trigger +full:- +full:order +full:- +full:start

5 This write-up is based on three articles published at lwn.net:
7 - <https://lwn.net/Articles/649115/> Pathname lookup in Linux
8 - <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
9 - <https://lwn.net/Articles/650786/> A walk among the symlinks
15 - per-directory parallel name lookup.
16 - ``openat2()`` resolution restriction flags.
27 the early parts of the analysis we will divide off symlinks - leaving
30 will allow us to review "REF-walk" and "RCU-walk" separately. But we
35 --------------------------
37 .. _openat: http://man7.org/linux/man-pages/man2/openat.2.html
43 non-"``/``" characters. These form two kinds of paths. Those that
44 start with slashes are "absolute" and start from the filesystem root.
45 The others are "relative" and start from the current directory, or
49 .. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html
81 A pathname that contains at least one non-<slash> character and
106 ----------------------
142 REF-walk: simple concurrency management with refcounts and spinlocks
143 --------------------------------------------------------------------
145 With all of those divisions carefully classified, we can now start
147 we will start with the handling of the "everything else" part of a
148 pathname, and focus on the "REF-walk" approach to concurrency
151 (indicating the use of RCU-walk) is set.
155 REF-walk is fairly heavy-handed with locks and reference counts. Not
156 as heavy-handed as in the old "big kernel lock" days, but certainly not
162 The locking mechanisms used by REF-walk include:
164 dentry->d_lockref
168 reference count. The special-sauce of this primitive is that the
174 will behave as expected. It also protects the ``->d_inode`` reference
193 So as long as a counted reference is held to a dentry, a non-``NULL`` ``->d_inode``
196 dentry->d_lock
205 When looking for a name in a directory, REF-walk takes ``d_lock`` on
211 REF-walk can take ``d_lock`` to get a stable reference to ``d_parent``,
232 The name-lookup process (``d_lookup()``) does *not* try to prevent this
244 ``-EAGAIN``.
246 inode->i_rwsem
279 If two threads attempt to look up the same name at the same time - a
280 name that is not yet in the dcache - the shared lock on ``i_rwsem`` will
284 per-dentry flag bit (``DCACHE_PAR_LOOKUP``).
302 dentry from the secondary hash table - it will normally have been
318 look ups from the start and will normally return something from the
321 mnt->mnt_count
324 ``mnt_count`` is a per-CPU reference counter on "``mount``" structures.
325 Per-CPU here means that incrementing the count is cheap as it only
326 uses CPU-local memory, but checking if the count is zero is expensive as
331 in particular, doesn't stabilize the link to the mounted-on dentry. It
334 filesystem. So a reference through ``->mnt_count`` provides a stable
335 reference to the mounted dentry, but not the mounted-on dentry.
352 When walking up the tree (towards the root) by following a ".." link,
356 needed to stabilize the link to the mounted-on dentry, which the
364 ``-EAGAIN``.
377 ----------------------------------------------
379 .. _First edition Unix: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
382 in a ``struct nameidata``, "namei" being the traditional name - dating
383 all the way back to `First Edition Unix`_ - of the function that
392 record the current status of the walk. They start out referring to the
415 only assigned the first time it is used, or when a non-standard root
425 pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
459 a symbolic link, step_into() calls pick_link() to deal with it,
463 This "hand-over-hand" sequencing of getting a reference to the new
466 analogue in the "RCU-walk" version.
469 ----------------------------
471 ``link_path_walk()`` only walks as far as setting ``nd->last`` and
472 ``nd->last_type`` to refer to the final component of the path. It does
479 ``path_parentat()`` is clearly the simplest - it just wraps a little bit
487 ``path_lookupat()`` is nearly as simple - it is used when an existing
515 ---------------------------
517 Apart from symbolic links, there are only two parts of the "REF-walk"
522 ``->d_revalidate()`` dentry method to ensure that the cached information
532 lookup a name can trigger changes to how that lookup should be
535 tree, but a few notes specifically related to path lookup are in order
540 to three different flags that might be set in ``dentry->d_flags``:
552 trigger a new automount.
559 ``d_manage()`` by returning ``-EISDIR``.
569 If this flag is set, and ``d_manage()`` didn't return ``-EISDIR``,
592 This will become more important next time when we examine RCU-walk
595 RCU-walk - faster pathname lookup in Linux
598 RCU-walk is another algorithm for performing pathname lookup in Linux.
599 It is in many ways similar to REF-walk and the two share quite a bit
600 of code. The significant difference in RCU-walk is how it allows for
603 We noted that REF-walk is complex because there are numerous details
604 and special cases. RCU-walk reduces this complexity by simply
605 refusing to handle a number of cases -- it instead falls back to
606 REF-walk. The difficulty with RCU-walk comes from a different
612 --------------------------
625 The REF-walk mechanism already described certainly doesn't follow this
627 be other threads modifying the data. RCU-walk, in contrast, is
631 other parts it is important that RCU-walk can quickly fall back to
632 using REF-walk.
634 Pathname lookup always starts in RCU-walk mode but only remains there
640 REF-walk.
643 ``vfsmount`` and ``dentry``, and ensuring that these are still valid -
644 that a path walk with REF-walk would have found the same entries.
645 This is an invariant that RCU-walk must guarantee. It can only make
647 REF-walk could also have made if it were walking down the tree at the
649 processed with the reliable, if slightly sluggish, REF-walk. If
650 RCU-walk finds it cannot stop gracefully, it simply gives up and
651 restarts from the top with REF-walk.
653 This pattern of "try RCU-walk, if that fails try REF-walk" can be
660 They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If
662 special flag to request "REF-walk". If either of those report the
665 revalidated - normally entries are only revalidated if the filesystem
669 REF-walk, but will never then try to switch back to RCU-walk. Places
670 that trip up RCU-walk are much more likely to be near the leaves and
675 --------------------------------
677 RCU is, unsurprisingly, critical to RCU-walk mode. The
678 ``rcu_read_lock()`` is held for the entire time that RCU-walk is walking
680 data structures - dentries, inodes, super_blocks, and mounts - will
687 As we saw above, REF-walk holds a counted reference to the current
691 taken to prevent certain changes from happening. RCU-walk must not
696 To preserve the invariant mentioned above (that RCU-walk may only make
697 decisions that REF-walk could have made), it must make the checks at
698 or near the same places that REF-walk holds the references. So, when
699 REF-walk increments a reference count or takes a spinlock, RCU-walk
701 similar function. When REF-walk decrements the count or drops the
702 lock, RCU-walk checks if the sampled status is still valid using
706 RCU-walk accesses two different fields in a seqlock-protected
709 is needed - which it usually is - RCU-walk must take a copy and then
713 imposes a memory barrier so that no memory-read instruction from
717 byte-wise name equality, calls into the filesystem to compare a name
720 are consistent, and only then is ``->d_compare()`` called. When
728 the bigger picture of how RCU-walk uses seqlocks.
730 ``mount_lock`` and ``nd->m_seq``
733 We already met the ``mount_lock`` seqlock when REF-walk used it to
734 ensure that crossing a mount point is performed safely. RCU-walk uses
738 descends the tree, RCU-walk samples the state of ``mount_lock`` at the
739 start of the walk and stores this initial sequence number in the
743 relatively rare, it is reasonable to fall back on REF-walk any time
746 ``m_seq`` is checked (using ``read_seqretry()``) at the end of an RCU-walk
747 sequence, whether switching to REF-walk for the rest of the path or
751 whole RCU-walk sequence is aborted and the path is processed again by
752 REF-walk.
754 If RCU-walk finds that ``mount_lock`` hasn't changed then it can be sure
755 that, had REF-walk taken counted references on each vfsmount, the
759 ``dentry->d_seq`` and ``nd->seq``
762 In place of taking a count or lock on ``d_reflock``, RCU-walk samples
763 the per-dentry ``d_seq`` seqlock, and stores the sequence number in the
764 ``seq`` field of the nameidata structure, so ``nd->seq`` should always be
765 the current sequence number of ``nd->dentry``. This number needs to be
775 ``mnt->mnt_mountpoint`` link to get a new dentry and collect its
778 mount point and follow the ``mnt->mnt_root`` link. This would imply a
783 The inode pointer, stored in ``->d_inode``, is a little more
791 ``lookup_fast()`` is the only lookup routine that is used in RCU-mode,
804 the old dentry which we saw in REF-walk.
806 No ``inode->i_rwsem`` or even ``rename_lock``
811 ``inode->i_rwsem`` plays no role in RCU-walk. If some other thread does
812 take ``i_rwsem`` and modifies the directory in a way that RCU-walk needs
813 to notice, the result will be either that RCU-walk fails to find the
816 REF-walk mode which can take whatever locks are needed.
818 Though ``rename_lock`` could be used by RCU-walk as it doesn't require
819 any sleeping, RCU-walk doesn't bother. REF-walk uses ``rename_lock`` to
822 something that actually is there. When RCU-walk fails to find
824 already drops down to REF-walk and tries again with appropriate
829 -----------------------------------------
831 That "dropping down to REF-walk" typically involves a call to
832 ``unlazy_walk()``, so named because "RCU-walk" is also sometimes
844 Other reasons for dropping out of RCU-walk that do not trigger a call
848 will return ``-ECHILD`` which will percolate up until it triggers a new
849 attempt from the top using REF-walk.
855 it, too, aborts with ``-ECHILD``, otherwise the transition to REF-walk
862 all. For ``dentry->d_lockref``, it is safe to increment the reference
864 "dead" which involves setting the counter to ``-128``.
867 For ``mnt->mnt_count`` it is safe to take a reference as long as
870 the standard way of calling ``mnt_put()`` - an unmount may have
878 --------------------------
880 RCU-walk depends almost entirely on cached information and often will
882 besides the already-mentioned component-name comparison, where the
883 file system might be included in RCU-walk, and it must know to be
886 If the filesystem has non-standard permission-checking requirements -
888 - the ``i_op->permission`` interface might be called during RCU-walk.
890 knows not to sleep, but to return ``-ECHILD`` if it cannot complete
891 promptly. ``i_op->permission`` is given the inode pointer, not the
901 ``d_op->d_revalidate`` may be called in RCU-walk too. This interface
910 ------------------
912 In various places in the details of REF-walk and RCU-walk, and also in
917 can see that in the high-level approach of first trying RCU-walk and
918 then trying REF-walk, and in places where ``unlazy_walk()`` is used to
919 switch to REF-walk for the rest of the path. We also saw it earlier
920 in ``dget_parent()`` when following a "``..``" link. It tries a quick way
924 again - repeatedly". This is seen with the use of ``rename_lock`` and
925 ``mount_lock`` in REF-walk. RCU-walk doesn't make use of this pattern -
944 Then a consideration of access-time updates and summary of the various
948 -----------------
958 a component name refers to a symbolic link, then that component is
959 replaced by the body of the link and, if that body starts with a '/',
961 "``readlink -f``" command does, though it also edits out "``.``" and
978 successfully - the error ``ELOOP`` must be returned. Loops can be
987 true loops, but also to "very deep" non-loops. It's not about memory
1012 ---------------------------------------
1016 to external storage. It is particularly important for RCU-walk to be
1018 it doesn't need to drop down into REF-walk.
1020 .. _object-oriented design pattern: https://lwn.net/Articles/446317/
1026 common `object-oriented design pattern`_ in the kernel). This will
1053 Taking a reference to a cache page is often possible even in RCU-walk
1056 out of RCU-walk mode completely. Even filesystems that allocate
1058 allocate memory without the need to drop out of RCU-walk. If a
1059 filesystem cannot successfully get a reference in RCU-walk mode, it
1060 must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to
1061 REF-walk mode in which the filesystem is allowed to sleep.
1063 The place for all this to happen is the ``i_op->get_link()`` inode
1064 method. This is called both in RCU-walk and REF-walk. In RCU-walk the
1065 ``dentry*`` argument is NULL, ``->get_link()`` can return -ECHILD to drop out of
1066 RCU-walk. Much like the ``i_op->permission()`` method we
1067 looked at previously, ``->get_link()`` would need to be careful that
1070 ``struct delayed_called`` will be passed to ``->get_link()``:
1072 set_delayed_call(). Later on, when VFS wants to put link, it will call
1075 In order for the reference to each symlink to be dropped when the walk completes,
1076 whether in RCU-walk or REF-walk, the symlink stack needs to contain,
1079 - the ``struct path`` to provide a reference to the previous path
1080 - the ``const char *`` to provide a reference to the to previous name
1081 - the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk
1082 - the ``struct delayed_call`` for later invocation.
1086 remnant). On a 64-bit system, this is about 40 bytes per entry;
1096 ---------------------
1099 components in the path and all of the non-final symlinks. As symlinks
1107 which returns the link from the filesystem.
1116 want to pop the symlink-just-completed off the stack before pushing
1117 the symlink-just-found to avoid leaving empty path remnants that would
1137 A pair of special-case symlinks deserve a little further explanation.
1149 aren't really (and are therefore commonly referred to as "magic-links")::
1151 $ ls -l /proc/self/fd/1
1152 lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
1157 objects you get a name that might refer to the same file - unless it
1159 one of these, the ``->get_link()`` method in "procfs" doesn't return
1161 ``nameidata`` in place to point to that target. ``->get_link()`` then
1166 --------------------------------------------
1203 will perform the separate ``i_op->lookup()`` and ``i_op->create()`` steps
1208 2. vfs_open() can fail with ``-EOPENSTALE`` if the cached information
1209 wasn't quite current enough. If it's in RCU-walk ``-ECHILD`` will be returned
1210 otherwise ``-ESTALE`` is returned. When ``-ESTALE`` is returned, the caller may
1216 ln -s bar /tmp/foo
1221 like for a non-creating open: lookup_last() or open_last_lookup()
1226 ------------------------
1228 We previously said of RCU-walk that it would "take no locks, increment
1251 care about access-time updates during pathname lookup.
1256 filesystem, at least, didn't update atime when following a link.
1260 quite complex. Trying to stay in RCU-walk while doing it is best
1266 ``relatime``, many filesystems record ``atime`` with a one-second
1269 It is easy to test if an ``atime`` update is needed while in RCU-walk
1270 mode and, if it isn't, the update can be skipped and RCU-walk mode
1272 path walk drop down to REF-walk. All of this is handled in the
1276 -----------
1294 to lookup: RCU-walk, REF-walk, and REF-walk with forced revalidation.
1308 link"). In this case the filesystem has not been asked to revalidate the
1310 to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
1311 ``ND_JUMPED`` is set when the look completes - which may be at the
1314 Resolution-restriction flags
1317 In order to allow userspace to protect itself against certain race conditions
1322 ``LOOKUP_NO_SYMLINKS`` blocks all symlink traversals (including magic-links).
1326 ``LOOKUP_NO_MAGICLINKS`` blocks all magic-link traversals. Filesystems must
1328 ``LOOKUP_NO_MAGICLINKS`` and other magic-link restrictions are implemented.
1331 bind-mounts and ordinary mounts). Note that the ``vfsmount`` which contains the
1332 lookup is determined by the first mountpoint the path lookup reaches --
1333 absolute paths start with the ``vfsmount`` of ``/``, and relative paths start
1334 with the ``dfd``'s ``vfsmount``. Magic-links are only permitted if the
1341 resolution of "..". Magic-links are also blocked.
1345 the starting point, and ".." at the starting point will act as a no-op. As with
1347 attacks against ".." resolution. Magic-links are also blocked.
1349 Final-component flags
1357 point, then the mount is triggered. Some operations would trigger it
1359 needs to trigger the mount but otherwise behaves a lot like ``stat()``, so
1361 "``mount --bind``".
1375 available to the filesystem and particularly the ``->d_revalidate()``
1378 These flags were previously useful for ``->lookup()`` too but with the
1379 introduction of ``->atomic_open()`` they are less relevant there.
1382 ---------------
1385 in good shape - various parts are certainly easier to understand now
1387 "finished". As already mentioned, RCU-walk currently only follows