Lines Matching +full:a +full:- +full:c
1 .. SPDX-License-Identifier: GPL-2.0
12 'VMA's of type :c:struct:`!struct vm_area_struct`.
14 Each VMA describes a virtually contiguous memory range with identical
15 attributes, each described by a :c:struct:`!struct vm_area_struct`
20 by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is,
22 :c:struct:`!mm`.
24 Each mm object contains a maple tree data structure which describes all VMAs
28 architectures which use :c:struct:`!vsyscall` and is a global static
31 -------
33 -------
36 on VMA **metadata** so a complicated set of locks are required to ensure memory
43 -----------
45 * **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
46 which locks at a process address space granularity which can be acquired via
47 :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
48 * **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
49 as a read/write semaphore in practice. A VMA read lock is obtained via
50 :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
51 write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
52 automatically when the mmap write lock is released). To take a VMA write lock
53 you **must** have already acquired an :c:func:`!mmap_write_lock`.
54 * **rmap locks** - When trying to access VMAs through the reverse mapping via a
55 :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
56 (reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via
57 :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for
58 anonymous memory and :c:func:`!i_mmap_[try]lock_read` or
59 :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
69 Stabilising a VMA also keeps the address space described by it around.
72 ----------
77 * Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a
78 suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when
80 * Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
81 acquire the lock atomically so might fail, in which case fall-back logic is
82 required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
85 anonymous or file-backed) to obtain the required VMA.
90 * Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
91 suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
93 * Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
94 modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
100 in order to obtain a VMA **write** lock. A VMA **read** lock however can be
101 obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then
104 This constrains the impact of writers on readers, as a writer can interact with
105 one VMA while a reader interacts with another simultaneously.
108 means that without a VMA write lock, page faults will run concurrent with
118 \- \- \- N N N N
119 \- R \- Y Y N N
120 \- \- R/W Y Y N N
121 R/W \-/R \-/R/W Y Y N N
122 W W \-/R Y Y Y N
126 .. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
127 attempting to do the reverse is invalid as it can result in deadlock - if
128 another task already holds an mmap write lock and attempts to acquire a VMA
132 obtain either a read or a write lock for each of these.
134 .. note:: Generally speaking, a read/write semaphore is a class of lock which
135 permits concurrent readers. However a write lock can only be obtained
139 This renders read locks on a read/write semaphore concurrent with other
145 We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
148 .. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
156 :c:member:`!vm_start` Inclusive start virtual address of range mmap write,
159 :c:member:`!vm_end` Exclusive end virtual address of range mmap write,
162 :c:member:`!vm_pgoff` Describes the page offset into the file, mmap write,
165 :c:func:`!mremap`), or PFN if a PFN map
167 :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
179 :c:member:`!vm_mm` Containing mm_struct. None - written once on
181 :c:member:`!vm_page_prot` Architecture-specific page table mmap write, VMA write.
184 :c:member:`!vm_flags` Read-only access to VMA flags describing N/A
187 :c:member:`!__vm_flags`.
188 :c:member:`!__vm_flags` Private, writable access to VMA flags mmap write, VMA write.
190 :c:func:`!vm_flags_*` functions.
191 :c:member:`!vm_file` If the VMA is file-backed, points to a None - written once on
194 :c:macro:`!NULL`.
195 :c:member:`!vm_ops` If the VMA is file-backed, then either None - Written once on
196 the driver or file-system provides a initial map by
197 :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
200 :c:member:`!vm_private_data` A :c:member:`!void *` field for Handled by driver.
201 driver-specific metadata.
206 .. table:: Config-specific fields
211 …:c:member:`!anon_name` CONFIG_ANON_VMA_NAME A field for storing a m…
212 … :c:struct:`!struct anon_vma_name` VMA write.
213 object providing a name for anonymous
214 mappings, or :c:macro:`!NULL` if none
215 is set or the VMA is file-backed. The
219 …:c:member:`!swap_readahead_info` CONFIG_SWAP Metadata used by the swap mechanism m…
220 … to perform readahead. This field is swap-specific
222 …:c:member:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which m…
226 …:c:member:`!numab_state` CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which m…
227 … describes the current state of numab-specific
230 :c:func:`!task_numa_work`.
231 …:c:member:`!vm_userfaultfd_ctx` CONFIG_USERFAULTFD Userfaultfd context wrapper object of m…
232 … type :c:type:`!vm_userfaultfd_ctx`, VMA write.
234 disabled, or containing a pointer
236 :c:type:`!userfaultfd_ctx` object which
248 …:c:member:`!shared.rb` A red/black tree node used, if the mmap write, VMA writ…
249 mapping is file-backed, to place the VMA i_mmap write.
251 :c:member:`!struct address_space->i_mmap`
253 …:c:member:`!shared.rb_subtree_last` Metadata used for management of the mmap write, VMA writ…
254 interval tree if the VMA is file-backed. i_mmap write.
255 …:c:member:`!anon_vma_chain` List of pointers to both forked/CoW’d mmap read, anon_vma …
256 :c:type:`!anon_vma` objects and
257 :c:member:`!vma->anon_vma` if it is
258 non-:c:macro:`!NULL`.
259 …:c:member:`!anon_vma` :c:type:`!anon_vma` object used by When :c:macro:`NULL`…
260 … anonymous folios mapped exclusively to setting non-:c:macro:`NULL`:
262 :c:func:`!anon_vma_prepare` serialised
263 … by the :c:macro:`!page_table_lock`. This When non-:c:macro:`NULL` and
264 … is set as soon as any page is faulted in. setting :c:macro:`NULL`:
270 anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
271 and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should
274 .. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set
275 then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap`
280 -----------
283 virtual addresses to physical ones through a series of page tables, each of
286 underlying physical data pages or a special entry such as a swap entry,
290 In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
303 1. **Traversing** page tables - Simply reading page tables in order to traverse
304 them. This only requires that the VMA is kept stable, so a lock which
306 which eliminate even this requirement, such as :c:func:`!gup_fast`).
307 2. **Installing** page table mappings - Whether creating a new mapping or
308 modifying an existing one in such a way as to change its identity. This
311 3. **Zapping/unmapping** page table entries - This is what the kernel calls
313 tables in place. This is a very common operation in the kernel performed on
314 file truncation, the :c:macro:`!MADV_DONTNEED` operation via
315 :c:func:`!madvise`, and others. This is performed by a number of functions
316 including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`.
318 4. **Freeing** page tables - When finally the kernel removes page tables from a
319 userland process (typically via :c:func:`!free_pgtables`) extreme care must
330 locks described in the terminology section above - that is the mmap lock, the
333 That is - as long as you keep the relevant VMA **stable** - you are good to go
336 serialise - see the page table implementation detail section for more details).
345 See :c:func:`!walk_page_range_novma` for details.
354 The :c:func:`!free_pgtables` function removes the relevant VMAs
359 -------------
366 but in doing so inadvertently cause a mutual deadlock.
368 For example, consider thread 1 which holds lock A and tries to acquire lock B,
369 while thread 2 holds lock B and tries to acquire lock A.
375 The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required
378 .. code-block::
380 inode->i_rwsem (while writing or truncating, not reading or faulting)
381 mm->mmap_lock
382 mapping->invalidate_lock (in filemap_fault)
386 mapping->i_mmap_rwsem
387 anon_vma->rwsem
388 mm->page_table_lock or pte_lock
391 mapping->private_lock (in block_dirty_folio)
393 lruvec->lru_lock (in folio_lruvec_lock_irq)
394 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
395 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
396 sb_lock (within inode_lock in fs/fs-writeback.c)
398 in arch-dependent flush_dcache_mmap_lock,
399 within bdi.wb->list_lock in __sync_single_inode)
401 There is also a file-system specific lock ordering comment located at the top of
402 :c:macro:`!mm/filemap.c`:
404 .. code-block::
406 ->i_mmap_rwsem (truncate_pagecache)
407 ->private_lock (__free_pte->block_dirty_folio)
408 ->swap_lock (exclusive_swap_page, others)
409 ->i_pages lock
411 ->i_rwsem
412 ->invalidate_lock (acquired by fs in truncate path)
413 ->i_mmap_rwsem (truncate->unmap_mapping_range)
415 ->mmap_lock
416 ->i_mmap_rwsem
417 ->page_table_lock or pte_lock (various, mainly in memory.c)
418 ->i_pages lock (arch-dependent flush_dcache_mmap_lock)
420 ->mmap_lock
421 ->invalidate_lock (filemap_fault)
422 ->lock_page (filemap_fault, access_process_vm)
424 ->i_rwsem (generic_perform_write)
425 ->mmap_lock (fault_in_readable->do_page_fault)
427 bdi->wb.list_lock
428 sb_lock (fs/fs-writeback.c)
429 ->i_pages lock (__sync_single_inode)
431 ->i_mmap_rwsem
432 ->anon_vma.lock (vma_merge)
434 ->anon_vma.lock
435 ->page_table_lock or pte_lock (anon_vma_prepare and various)
437 ->page_table_lock or pte_lock
438 ->swap_lock (try_to_unmap_one)
439 ->private_lock (try_to_unmap_one)
440 ->i_pages lock (try_to_unmap_one)
441 ->lruvec->lru_lock (follow_page_mask->mark_page_accessed)
442 ->lruvec->lru_lock (check_pte_range->folio_isolate_lru)
443 ->private_lock (folio_remove_rmap_pte->set_page_dirty)
444 ->i_pages lock (folio_remove_rmap_pte->set_page_dirty)
445 bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty)
446 ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty)
447 bdi.wb->list_lock (zap_pte_range->set_page_dirty)
448 ->inode->i_lock (zap_pte_range->set_page_dirty)
449 ->private_lock (zap_pte_range->block_dirty_folio)
454 ------------------------------
456 ------------------------------
458 .. warning:: Locking rules for PTE-level page tables are very different from
462 --------------------------
467 * **Higher level page table locks** - Higher level page tables, that is PGD, P4D
469 :c:member:`!mm->page_table_lock` lock when modified.
471 * **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
473 separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
474 set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
475 mapped into higher memory (if a 32-bit system) and carefully locked via
476 :c:func:`!pte_offset_map_lock`.
481 Importantly, note that on a **traversal** of page tables, sometimes no such
494 * When changing a page table entry the page table lock for that page table
496 tables concurrently (such as on invocation of :c:func:`!free_pgtables`).
506 VMAs, :c:func:`!vms_clear_ptes` has a window of time between
507 zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via
508 :c:func:`!free_pgtables`), where the VMA is still visible in the
509 rmap tree. :c:func:`!free_pgtables` assumes that the zap has
518 PTE-level page tables are different from page tables at other levels, and there
521 * On 32-bit architectures, they may be in high memory (meaning they need to be
523 * When empty, they can be unlinked and RCU-freed while holding an mmap lock or
525 In particular, this happens in :c:func:`!retract_page_tables` when handling
526 :c:macro:`!MADV_COLLAPSE`.
527 So accessing PTE-level page tables requires at least holding an RCU read lock;
529 page table updates such that an empty PTE is observed (in a page table that
533 PMD entry still refers to the same PTE-level page table.
534 If the writer does not care whether it is the same PTE-level page table, it
536 the requirements. In particular, this also happens in :c:func:`!retract_page_tables`
537 when handling :c:macro:`!MADV_COLLAPSE`.
539 To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or
540 :c:func:`!pte_offset_map` can be used depending on stability requirements.
543 See the comment on :c:func:`!__pte_offset_map_lock`.
551 functionality like GUP-fast locklessly traverses (that is reads) page tables,
554 When performing a page table traversal and keeping the VMA stable, whether a
556 (for instance x86-64 does not require any special precautions).
558 If a write is being performed, or if a read informs whether a write takes place
559 (on an installation of a page table entry say, for instance in
560 :c:func:`!__pud_install`), special care must always be taken. In these cases we
565 does not rearrange our loads. This is achieved via :c:func:`!pXXp_get`
566 functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`,
567 :c:func:`!pmdp_get`, and :c:func:`!ptep_get`.
569 Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
574 operation as, for example, in :c:func:`!ptep_get_and_clear`.
577 GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like
578 :c:func:`!gup_fast_pte_range`), must very carefully interact with page table
579 entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for
583 by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
584 :c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`.
587 as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`,
588 :c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and
589 :c:func:`!pte_clear`.
598 When allocating a P4D, PUD or PMD and setting the relevant entry in the above
599 PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
600 acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
601 :c:func:`!__pmd_alloc` respectively.
603 .. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
604 :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
605 references the :c:member:`!mm->page_table_lock`.
607 Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
608 :c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD
609 physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
610 :c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
611 :c:func:`!__pte_alloc`.
615 access to entries contained within a PTE, especially when we wish to modify
618 This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
620 :c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
621 the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
622 must be released via :c:func:`!pte_unmap_unlock`.
625 :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
627 :c:func:`!__pte_offset_map_lock` for more details.
634 A typical pattern taken when traversing page table entries to install a new
639 This allows for a traversal with page table locks only being taken when
640 required. An example of this is :c:func:`!__pud_alloc`.
643 as we have separate PMD and PTE locks and a THP collapse for instance might have
646 This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry
648 PTE-specific lock, and then *again* checking that the PMD entry is as expected.
650 If a THP collapse (or similar) were to occur then the lock on both pages would
663 prevent racing faults, and rmap operations), as a file-backed mapping can be
664 truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone.
666 As a result, no VMA which can be accessed via the reverse mapping (either
667 through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct
668 address_space->i_mmap` interval trees) can have its page tables torn down.
670 The operation is typically performed via :c:func:`!free_pgtables`, which assumes
672 :c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable.
678 Additionally, it assumes that a zap has already been performed and steps have
680 the zap and the invocation of :c:func:`!free_pgtables`.
683 cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`,
684 :c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions.
688 :c:func:`!retract_page_tables`, which is performed under the i_mmap
695 page tables). Most notable of these is :c:func:`!mremap`, which is capable of
701 You can observe this in the :c:func:`!mremap` implementation in the functions
702 :c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
703 side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
706 ------------------
711 VMA read locking is entirely optimistic - if the lock is contended or a competing
712 write has started, then we do not obtain a read lock.
714 A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first
715 calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
716 critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
717 before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
719 VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
720 their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
721 via :c:func:`!vma_end_read`.
723 VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
724 VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
727 lock so there is no :c:func:`!vma_end_write` function.
729 Note that a semaphore write lock is not held across a VMA lock. Rather, a
733 This ensures the semantics we require - VMA write locks provide exclusive write
739 The VMA lock mechanism is designed to be a lightweight means of avoiding the use
740 of the heavily contended mmap lock. It is implemented using a combination of a
742 :c:struct:`!struct mm_struct` and the VMA.
744 Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
745 operation, i.e. it tries to acquire a read lock but returns false if it is
746 unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
749 Invoking :c:func:`!vma_start_read` requires that :c:func:`!rcu_read_lock` has
752 required for lookup. This is abstracted by :c:func:`!lock_vma_under_rcu` which
753 is the interface a user should use.
755 Writing requires the mmap to be write-locked and the VMA lock to be acquired via
756 :c:func:`!vma_start_write`, however the write lock is released by the termination or
757 downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
759 All this is achieved by the use of per-mm and per-VMA sequence counts, which are
760 used in order to reduce complexity, especially for operations which write-lock
763 If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
764 sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
767 Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
768 :c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
769 also increments :c:member:`!mm->mm_lock_seq` via
770 :c:func:`!mm_lock_seqcount_end`.
772 This way, we ensure that, regardless of the VMA's sequence number, a write lock
782 Each time a VMA read lock is acquired, we acquire a read lock on the
783 :c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
789 Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
793 On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
797 This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
801 complexity with a long-term held write lock.
803 This clever combination of a read/write semaphore and sequence count allows for
804 fast RCU-based per-VMA lock acquisition (especially on page fault, though
808 ---------------------------
814 It is then possible to **downgrade** from a write lock to a read lock via
815 :c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`,
816 implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but
821 against any other task possessing a downgraded lock (since a racing task would
822 have to acquire a write lock first to downgrade it, and the downgraded lock
823 prevents a new write lock from being obtained until the original lock is
829 .. list-table:: Lock exclusivity
831 :header-rows: 1
832 :stub-columns: 1
834 * -
835 - R
836 - D
837 - W
838 * - R
839 - N
840 - N
841 - Y
842 * - D
843 - N
844 - Y
845 - Y
846 * - W
847 - Y
848 - Y
849 - Y
851 Here a Y indicates the locks in the matching row/column are mutually exclusive,
855 ---------------
858 to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to
859 prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.