1Demonstrations of compactstall, the Linux eBPF/bcc version. 2 3 4compactsnoop traces the compact zone system-wide, and print various details. 5Example output (manual trigger by echo 1 > /proc/sys/vm/compact_memory): 6 7# ./compactsnoop 8COMM PID NODE ZONE ORDER MODE LAT(ms) STATUS 9zsh 23685 0 ZONE_DMA -1 SYNC 0.025 complete 10zsh 23685 0 ZONE_DMA32 -1 SYNC 3.925 complete 11zsh 23685 0 ZONE_NORMAL -1 SYNC 113.975 complete 12zsh 23685 1 ZONE_NORMAL -1 SYNC 81.57 complete 13zsh 23685 0 ZONE_DMA -1 SYNC 0.02 complete 14zsh 23685 0 ZONE_DMA32 -1 SYNC 4.631 complete 15zsh 23685 0 ZONE_NORMAL -1 SYNC 113.975 complete 16zsh 23685 1 ZONE_NORMAL -1 SYNC 80.647 complete 17zsh 23685 0 ZONE_DMA -1 SYNC 0.020 complete 18zsh 23685 0 ZONE_DMA32 -1 SYNC 3.367 complete 19zsh 23685 0 ZONE_NORMAL -1 SYNC 115.18 complete 20zsh 23685 1 ZONE_NORMAL -1 SYNC 81.766 complete 21zsh 23685 0 ZONE_DMA -1 SYNC 0.025 complete 22zsh 23685 0 ZONE_DMA32 -1 SYNC 4.346 complete 23zsh 23685 0 ZONE_NORMAL -1 SYNC 114.570 complete 24zsh 23685 1 ZONE_NORMAL -1 SYNC 80.820 complete 25zsh 23685 0 ZONE_DMA -1 SYNC 0.026 complete 26zsh 23685 0 ZONE_DMA32 -1 SYNC 4.611 complete 27zsh 23685 0 ZONE_NORMAL -1 SYNC 113.993 complete 28zsh 23685 1 ZONE_NORMAL -1 SYNC 80.928 complete 29zsh 23685 0 ZONE_DMA -1 SYNC 0.02 complete 30zsh 23685 0 ZONE_DMA32 -1 SYNC 3.889 complete 31zsh 23685 0 ZONE_NORMAL -1 SYNC 113.776 complete 32zsh 23685 1 ZONE_NORMAL -1 SYNC 80.727 complete 33^C 34 35While tracing, the processes alloc pages due to memory fragmentation is too 36serious to meet contiguous memory requirements in the system, compact zone 37events happened, which will increase the waiting delay of the processes. 38 39compactsnoop can be useful for discovering when compact_stall(/proc/vmstat) 40continues to increase, whether it is caused by some critical processes or not. 41 42The STATUS include (CentOS 7.6's kernel) 43 44 compact_status = { 45 # COMPACT_SKIPPED: compaction didn't start as it was not possible or direct reclaim was more suitable 46 0: "skipped", 47 # COMPACT_CONTINUE: compaction should continue to another pageblock 48 1: "continue", 49 # COMPACT_PARTIAL: direct compaction partially compacted a zone and there are suitable pages 50 2: "partial", 51 # COMPACT_COMPLETE: The full zone was compacted 52 3: "complete", 53 } 54 55or (kernel 4.7 and above) 56 57 compact_status = { 58 # COMPACT_NOT_SUITABLE_ZONE: For more detailed tracepoint output - internal to compaction 59 0: "not_suitable_zone", 60 # COMPACT_SKIPPED: compaction didn't start as it was not possible or direct reclaim was more suitable 61 1: "skipped", 62 # COMPACT_DEFERRED: compaction didn't start as it was deferred due to past failures 63 2: "deferred", 64 # COMPACT_NOT_SUITABLE_PAGE: For more detailed tracepoint output - internal to compaction 65 3: "no_suitable_page", 66 # COMPACT_CONTINUE: compaction should continue to another pageblock 67 4: "continue", 68 # COMPACT_COMPLETE: The full zone was compacted scanned but wasn't successful to compact suitable pages. 69 5: "complete", 70 # COMPACT_PARTIAL_SKIPPED: direct compaction has scanned part of the zone but wasn't successful to compact suitable pages. 71 6: "partial_skipped", 72 # COMPACT_CONTENDED: compaction terminated prematurely due to lock contentions 73 7: "contended", 74 # COMPACT_SUCCESS: direct compaction terminated after concluding that the allocation should now succeed 75 8: "success", 76 } 77 78The -p option can be used to filter on a PID, which is filtered in-kernel. Here 79I've used it with -T to print timestamps: 80 81# ./compactsnoop -Tp 24376 82TIME(s) COMM PID NODE ZONE ORDER MODE LAT(ms) STATUS 83101.364115000 zsh 24376 0 ZONE_DMA -1 SYNC 0.025 complete 84101.364555000 zsh 24376 0 ZONE_DMA32 -1 SYNC 3.925 complete 85^C 86 87This shows the zsh process allocs pages, and compact zone events happening, 88and the delays are not affected much. 89 90A maximum tracing duration can be set with the -d option. For example, to trace 91for 2 seconds: 92 93# ./compactsnoop -d 2 94COMM PID NODE ZONE ORDER MODE LAT(ms) STATUS 95zsh 26385 0 ZONE_DMA -1 SYNC 0.025444 complete 96^C 97 98The -e option prints out extra columns 99 100# ./compactsnoop -e 101COMM PID NODE ZONE ORDER MODE FRAGIDX MIN LOW HIGH FREE LAT(ms) STATUS 102summ 28276 1 ZONE_NORMAL 3 ASYNC 0.728 11284 14105 16926 14193 3.58 partial 103summ 28276 0 ZONE_NORMAL 2 ASYNC -1.000 11043 13803 16564 14479 0.0 complete 104summ 28276 1 ZONE_NORMAL 2 ASYNC -1.000 11284 14105 16926 14785 0.019 complete 105summ 28276 0 ZONE_NORMAL 2 ASYNC -1.000 11043 13803 16564 15199 0.006 partial 106summ 28276 1 ZONE_NORMAL 2 ASYNC -1.000 11284 14105 16926 17360 0.030 complete 107summ 28276 0 ZONE_NORMAL 2 ASYNC -1.000 11043 13803 16564 15443 0.024 complete 108summ 28276 1 ZONE_NORMAL 2 ASYNC -1.000 11284 14105 16926 15634 0.018 complete 109summ 28276 1 ZONE_NORMAL 3 ASYNC 0.832 11284 14105 16926 15301 0.006 partial 110summ 28276 0 ZONE_NORMAL 2 ASYNC -1.000 11043 13803 16564 14774 0.005 partial 111summ 28276 1 ZONE_NORMAL 3 ASYNC 0.733 11284 14105 16926 19888 0.012 partial 112^C 113 114The FRAGIDX is short for fragmentation index, which only makes sense if an 115allocation of a requested size would fail. If that is true, the fragmentation 116index indicates whether external fragmentation or a lack of memory was the 117problem. The value can be used to determine if page reclaim or compaction 118should be used. 119 120Index is between 0 and 1 so return within 3 decimal places 121 1220 => allocation would fail due to lack of memory 1231 => allocation would fail due to fragmentation 124 125We can see the whole buddy's fragmentation index from /sys/kernel/debug/extfrag/extfrag_index 126 127The MIN/LOW/HIGH shows the watermarks of the zone, which can also get from 128/proc/zoneinfo, and FREE means nr_free_pages (can be found in /proc/zoneinfo too). 129 130 131The -K option prints out kernel stack 132 133# ./compactsnoop -K -e 134 135summ 28276 0 ZONE_NORMAL 3 ASYNC 0.528 11043 13803 16564 22654 13.258 partial 136 kretprobe_trampoline+0x0 137 try_to_compact_pages+0x121 138 __alloc_pages_direct_compact+0xac 139 __alloc_pages_slowpath+0x3e9 140 __alloc_pages_nodemask+0x404 141 alloc_pages_current+0x98 142 new_slab+0x2c5 143 ___slab_alloc+0x3ac 144 __slab_alloc+0x40 145 kmem_cache_alloc_node+0x8b 146 copy_process+0x18e 147 do_fork+0x91 148 sys_clone+0x16 149 stub_clone+0x44 150 151summ 28276 1 ZONE_NORMAL 3 ASYNC -1.000 11284 14105 16926 22074 0.008 partial 152 kretprobe_trampoline+0x0 153 try_to_compact_pages+0x121 154 __alloc_pages_direct_compact+0xac 155 __alloc_pages_slowpath+0x3e9 156 __alloc_pages_nodemask+0x404 157 alloc_pages_current+0x98 158 new_slab+0x2c5 159 ___slab_alloc+0x3ac 160 __slab_alloc+0x40 161 kmem_cache_alloc_node+0x8b 162 copy_process+0x18e 163 do_fork+0x91 164 sys_clone+0x16 165 stub_clone+0x44 166 167summ 28276 0 ZONE_NORMAL 3 ASYNC 0.527 11043 13803 16564 25653 9.812 partial 168 kretprobe_trampoline+0x0 169 try_to_compact_pages+0x121 170 __alloc_pages_direct_compact+0xac 171 __alloc_pages_slowpath+0x3e9 172 __alloc_pages_nodemask+0x404 173 alloc_pages_current+0x98 174 new_slab+0x2c5 175 ___slab_alloc+0x3ac 176 __slab_alloc+0x40 177 kmem_cache_alloc_node+0x8b 178 copy_process+0x18e 179 do_fork+0x91 180 sys_clone+0x16 181 stub_clone+0x44 182 183# ./compactsnoop -h 184usage: compactsnoop.py [-h] [-T] [-p PID] [-d DURATION] [-K] [-e] 185 186Trace compact zone 187 188optional arguments: 189 -h, --help show this help message and exit 190 -T, --timestamp include timestamp on output 191 -p PID, --pid PID trace this PID only 192 -d DURATION, --duration DURATION 193 total duration of trace in seconds 194 -K, --kernel-stack output kernel stack trace 195 -e, --extended_fields 196 show system memory state 197 198examples: 199 ./compactsnoop # trace all compact stall 200 ./compactsnoop -T # include timestamps 201 ./compactsnoop -d 10 # trace for 10 seconds only 202 ./compactsnoop -K # output kernel stack trace 203 ./compactsnoop -e # show extended fields 204