resctrl.rst - OpenGrok cross reference for /linux-6.14.4/Documentation/arch/x86/resctrl.rst

Lines Matching +full:keys +full:- +full:per +full:- +full:group
1 .. SPDX-License-Identifier: GPL-2.0
9 :Authors: - Fenghua Yu <[email protected]>
10           - Tony Luck <[email protected]>
11           - Vikas Shivappa <[email protected]>
38  # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
57 pseudo-locking is a unique way of using cache control to "pin" or
59 "Cache Pseudo-Locking".
96 		own settings for cache use which can over-ride
114 			      "shareable_bits" but no resource group will
120 			      well as a resource group's allocation.
126 			      one resource group. No sharing allowed.
128 			      Corresponding region is pseudo-locked. No
131 		Indicates if non-contiguous 1s value in CBM is supported.
136 			      Non-contiguous 1s value in CBM is supported.
155 		non-linear. This field is purely informational
166 		"per-thread":
216 	5       Reads to slow memory in the non-local NUMA domain
218 	3       Non-temporal writes to non-local NUMA domain
219 	2       Non-temporal writes to local NUMA domain
220 	1       Reads to memory in the non-local NUMA domain
262 		counter can be considered for re-use.
275 	mask f7 has non-consecutive 1-bits
281 system.  The default group is the root directory which, immediately
293 group that is their ancestor. These are called "MON" groups in the rest
296 Removing a directory will move all tasks and cpus owned by the group it
300 Moving MON group directories to a new parent CTRL_MON group is supported
301 for the purpose of changing the resource allocations of a MON group
305 MON group.
311 	this group. Writing a task id to the file will add a task to the
312 	group. Multiple tasks can be added by separating the task ids
316 	already added tasks before the failure will remain in the group.
319 	If the group is a CTRL_MON group the task is removed from
320 	whichever previous CTRL_MON group owned the task and also from
321 	any MON group that owned the task. If the group is a MON group,
323 	group. The task is removed from any previous MON group.
328 	this group. Writing a mask to this file will add and remove
329 	CPUs to/from this group. As with the tasks file a hierarchy is
331 	parent CTRL_MON group.
332 	When the resource group is in pseudo-locked mode this file will
334 	pseudo-locked region.
344 	A list of all the resources available to this group.
345 	Each resource has its own line and format - see below for details.
353 	The "mode" of the resource group dictates the sharing of its
354 	allocations. A "shareable" resource group allows sharing of its
355 	allocations while an "exclusive" resource group does not. A
356 	cache pseudo-locked region is created by first writing
357 	"pseudo-locksetup" to the "mode" file before writing the cache
358 	pseudo-locked region's schemata to the resource group's "schemata"
359 	file. On successful pseudo-locked region creation the mode will
360 	automatically change to "pseudo-locked".
364 	for the control group. On x86 this is the CLOSID.
372 	directories have one file per event (e.g. "llc_occupancy",
373 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
375 	all tasks in the group. In CTRL_MON groups these files provide
376 	the sum for all tasks in the CTRL_MON group and all tasks in
378 	On systems with Sub-NUMA Cluster (SNC) enabled there are extra
385 	for the monitor group. On x86 this is the RMID.
398 -------------------------
403 1) If the task is a member of a non-default group, then the schemata
404    for that group is used.
406 2) Else if the task belongs to the default group, but is running on a
407    CPU that is assigned to some specific group, then the schemata for the
408    CPU's group is used.
410 3) Otherwise the schemata for the default group is used.
413 -------------------------
414 1) If a task is a member of a MON group, or non-default CTRL_MON group
415    then RDT events for the task will be reported in that group.
417 2) If a task is a member of the default CTRL_MON group, but is running
418    on a CPU that is assigned to some specific group, then the RDT events
419    for the task will be reported in that group.
422    "mon_data" group.
427 When moving a task from one group to another you should remember that
429 a task in a monitor group showing 3 MB of cache occupancy. If you move
430 to a new group and immediately check the occupancy of the old and new
431 groups you will likely see that the old group is still showing 3 MB and
432 the new group zero. When the task accesses locations still in cache from
434 you will likely see the occupancy in the old group go down as cache lines
435 are evicted and re-used while the occupancy in the new group rises as
437 membership in the new group.
439 The same applies to cache allocation control. Moving a task to a group
444 to identify a control group and a monitoring group respectively. Each of
445 the resource groups are mapped to these IDs based on the kind of group. The
448 and creation of "MON" group may fail if we run out of RMIDs.
450 max_threshold_occupancy - generic concepts
451 ------------------------------------------
457 limbo RMIDs but which are not ready to be used, user may see an -EBUSY
466 to attempt to create an empty monitor group to force an update. Output may
467 only be produced if creation of a control or monitor group fails.
469 Schemata files - general concepts
470 ---------------------------------
476 ---------
477 On current generation systems there is one L3 cache per socket and L2
488 ---------------------
495 0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
497 if non-contiguous 1s value is supported. On a system with a 20-bit mask
501 Notes on Sub-NUMA Cluster mode
503 When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
505 on Sub-NUMA nodes share the same L3 cache and the system may report
506 the NUMA distance between Sub-NUMA nodes with a lower value than used
509 The top-level monitoring files in each "mon_L3_XX" directory provide
511 Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
520 of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
522 with two SNC nodes per L3 cache, each bit only represents 5MB.
585 ----------------------------------------------------------------
591 ------------------------------------------------------------------
599 ------------------------
612 ------------------------------------------
620 ----------------------------------------------
628 ---------------------------------------
647 ---------------------------------
662 --------------------------------------------------
682 --------------------------------------------------------------------
701 Cache Pseudo-Locking
704 application can fill. Cache pseudo-locking builds on the fact that a
705 CPU can still read and write data pre-allocated outside its current
706 allocated area on a cache hit. With cache pseudo-locking, data can be
709 pseudo-locked memory is made accessible to user space where an
713 The creation of a cache pseudo-locked region is triggered by a request
715 to be pseudo-locked. The cache pseudo-locked region is created as follows:
717 - Create a CAT allocation CLOSNEW with a CBM matching the schemata
718   from the user of the cache region that will contain the pseudo-locked
721   while the pseudo-locked region exists.
722 - Create a contiguous region of memory of the same size as the cache
724 - Flush the cache, disable hardware prefetchers, disable preemption.
725 - Make CLOSNEW the active CLOS and touch the allocated memory to load
727 - Set the previous CLOS as active.
728 - At this point the closid CLOSNEW can be released - the cache
729   pseudo-locked region is protected as long as its CBM does not appear in
730   any CAT allocation. Even though the cache pseudo-locked region will from
732   any CLOS will be able to access the memory in the pseudo-locked region since
734 - The contiguous region of memory loaded into the cache is exposed to
735   user-space as a character device.
737 Cache pseudo-locking increases the probability that data will remain
741 “locked” data from cache. Power management C-states may shrink or
742 power off cache. Deeper C-states will automatically be restricted on
743 pseudo-locked region creation.
745 It is required that an application using a pseudo-locked region runs
747 with the cache on which the pseudo-locked region resides. A sanity check
748 within the code will not allow an application to map pseudo-locked memory
750 pseudo-locked region resides. The sanity check is only done during the
754 Pseudo-locking is accomplished in two stages:
757    of cache that should be dedicated to pseudo-locking. At this time an
760 2) During the second stage a user-space application maps (mmap()) the
761    pseudo-locked memory into its address space.
763 Cache Pseudo-Locking Interface
764 ------------------------------
765 A pseudo-locked region is created using the resctrl interface as follows:
767 1) Create a new resource group by creating a new directory in /sys/fs/resctrl.
768 2) Change the new resource group's mode to "pseudo-locksetup" by writing
769    "pseudo-locksetup" to the "mode" file.
770 3) Write the schemata of the pseudo-locked region to the "schemata" file. All
774 On successful pseudo-locked region creation the "mode" file will contain
775 "pseudo-locked" and a new character device with the same name as the resource
776 group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
777 by user space in order to obtain access to the pseudo-locked memory region.
779 An example of cache pseudo-locked region creation and usage can be found below.
781 Cache Pseudo-Locking Debugging Interface
782 ----------------------------------------
783 The pseudo-locking debugging interface is enabled by default (if
787 location is present in the cache. The pseudo-locking debugging interface uses
789 the pseudo-locked region:
793    example below). In this test the pseudo-locked region is traversed at
801 When a pseudo-locked region is created a new debugfs directory is created for
803 write-only file, pseudo_lock_measure, is present in this directory. The
804 measurement of the pseudo-locked region depends on the number written to this
825 In this example a pseudo-locked region named "newlock" was created. Here is
831   # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
839   # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
859 In this example a pseudo-locked region named "newlock" was created on the L2
872   #                              _-----=> irqs-off
873   #                             / _----=> need-resched
874   #                            | / _---=> hardirq/softirq
875   #                            || / _--=> preempt-depth
877   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
879   pseudo_lock_mea-1672  [002] ....  3132.860500: pseudo_lock_l2: hits=4097 miss=0
887 On a two socket machine (one L3 cache per socket) with just four bits
892   # mount -t resctrl resctrl /sys/fs/resctrl
898 The default resource group is unmodified, so we have access to all parts
901 Tasks that are under the control of group "p0" may only allocate from the
903 Tasks in group "p1" use the "lower" 50% of cache on both sockets.
905 Similarly, tasks that are under the control of group "p0" may use a
907 Tasks in group "p1" may also use 50% memory b/w on both sockets.
910 b/w that the group may be able to use and the system admin can configure
925 Again two sockets, but this time with a more realistic 20-bit mask.
928 processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
929 neighbors, each of the two real-time tasks exclusively occupies one quarter
933   # mount -t resctrl resctrl /sys/fs/resctrl
936 First we reset the schemata for the default group so that the "upper"
942 Next we make a resource group for our first real time task and give
949 Finally we move our first real time task into this resource group. We
956   # taskset -cp 1 1234
963   # taskset -cp 2 5678
972   # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
978   # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
982 A single socket system which has real-time tasks running on core 4-7 and
983 non real-time workload assigned to core 0-3. The real-time tasks share text
984 and data, so a per task association is not required and due to interaction
989   # mount -t resctrl resctrl /sys/fs/resctrl
992 First we reset the schemata for the default group so that the "upper"
998 Next we make a resource group for our real time cores and give it access
1006 Finally we move core 4-7 over to the new group and make sure that the
1008 also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
1009 siblings and only the real time threads are scheduled on the cores 4-7.
1017 mode allowing sharing of their cache allocations. If one resource group
1018 configures a cache allocation then nothing prevents another resource group
1021 In this example a new exclusive resource group will be created on a L2 CAT
1022 system with two L2 cache instances that can be configured with an 8-bit
1023 capacity bitmask. The new exclusive resource group will be configured to use
1027   # mount -t resctrl resctrl /sys/fs/resctrl/
1030 First, we observe that the default group is configured to allocate to all L2
1036 We could attempt to create the new resource group at this point, but it will
1037 fail because of the overlap with the schemata of the default group::
1044   -sh: echo: write error: Invalid argument
1048 To ensure that there is no overlap with another resource group the default
1049 resource group's schemata has to change, making it possible for the new
1050 resource group to become exclusive.
1061 A new resource group will on creation not overlap with an exclusive resource
1062 group::
1076 A resource group cannot be forced to overlap with an exclusive resource group::
1079   -sh: echo: write error: Invalid argument
1081   overlaps with exclusive group
1083 Example of Cache Pseudo-Locking
1085 Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
1090   # mount -t resctrl resctrl /sys/fs/resctrl/
1093 Ensure that there are bits available that can be pseudo-locked, since only
1094 unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
1095 removed from the default resource group's schemata::
1103 Create a new resource group that will be associated with the pseudo-locked
1104 region, indicate that it will be used for a pseudo-locked region, and
1105 configure the requested pseudo-locked region capacity bitmask::
1108   # echo pseudo-locksetup > newlock/mode
1111 On success the resource group's mode will change to pseudo-locked, the
1112 bit_usage will reflect the pseudo-locked region, and the character device
1113 exposing the pseudo-locked region will exist::
1116   pseudo-locked
1119   # ls -l /dev/pseudo_lock/newlock
1120   crw------- 1 root root 243, 0 Apr  3 05:01 /dev/pseudo_lock/newlock
1125   * Example code to access one page of pseudo-locked cache region
1138   * cores associated with the pseudo-locked region. Here the cpu
1175     /* Application interacts with pseudo-locked memory @mapping */
1189 ----------------------------
1197   1. Read the cbmmasks from each directory or the per-resource "bit_usage"
1228   $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
1232   $ cat create-dir.sh
1234   mask = function-of(output.txt)
1238   $ flock /sys/fs/resctrl/ ./create-dir.sh
1257       exit(-1);
1269       exit(-1);
1281       exit(-1);
1290     if (fd == -1) {
1292       exit(-1);
1306 ----------------------
1309 group or CTRL_MON group.
1312 Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
1313 ------------------------------------------------------------------------
1314 On a two socket machine (one L3 cache per socket) with just four bits
1317   # mount -t resctrl resctrl /sys/fs/resctrl
1325 The default resource group is unmodified, so we have access to all parts
1328 Tasks that are under the control of group "p0" may only allocate from the
1330 Tasks in group "p1" use the "lower" 50% of cache on both sockets.
1332 Create monitor groups and assign a subset of tasks to each monitor group.
1350 The parent ctrl_mon group shows the aggregated data.
1357 --------------------------------------------
1358 On a two socket machine (one L3 cache per socket)::
1360   # mount -t resctrl resctrl /sys/fs/resctrl
1364 An RMID is allocated to the group once its created and hence the <cmd>
1377 ---------------------------------------------------------------------
1381 But user can create different MON groups within the root group thereby
1388   # mount -t resctrl resctrl /sys/fs/resctrl
1396 Monitor the groups separately and also get per domain data. From the
1412 -----------------------------------
1414 A single socket system which has real time tasks running on cores 4-7
1419   # mount -t resctrl resctrl /sys/fs/resctrl
1423 Move the cpus 4-7 over to p1::
1436 -----------------------------------------------------------------
1451 +---------------+---------------+---------------+-----------------+
1453 +---------------+---------------+---------------+-----------------+
1455 +---------------+---------------+---------------+-----------------+
1457 +---------------+---------------+---------------+-----------------+
1459 +---------------+---------------+---------------+-----------------+
1461 +---------------+---------------+---------------+-----------------+
1463 +---------------+---------------+---------------+-----------------+
1465 +---------------+---------------+---------------+-----------------+
1467 +---------------+---------------+---------------+-----------------+
1469 +---------------+---------------+---------------+-----------------+
1471 +---------------+---------------+---------------+-----------------+
1473 +---------------+---------------+---------------+-----------------+
1475 +---------------+---------------+---------------+-----------------+
1477 +---------------+---------------+---------------+-----------------+
1479 +---------------+---------------+---------------+-----------------+
1481 +---------------+---------------+---------------+-----------------+
1483 +---------------+---------------+---------------+-----------------+
1485 +---------------+---------------+---------------+-----------------+
1487 +---------------+---------------+---------------+-----------------+
1489 +---------------+---------------+---------------+-----------------+
1491 +---------------+---------------+---------------+-----------------+
1493 +---------------+---------------+---------------+-----------------+
1495 +---------------+---------------+---------------+-----------------+
1497 +---------------+---------------+---------------+-----------------+
1499 +---------------+---------------+---------------+-----------------+
1501 +---------------+---------------+---------------+-----------------+
1503 +---------------+---------------+---------------+-----------------+
1505 +---------------+---------------+---------------+-----------------+
1507 +---------------+---------------+---------------+-----------------+
1515 …958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
1517 2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
1518 …w.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
1521 …are.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-…