xref: /aosp_15_r20/art/runtime/thread_suspension_timeouts.md (revision 795d594fd825385562da6b089ea9b2033f3abf5a)
1*795d594fSAndroid Build Coastguard WorkerThread Suspension timeouts in ART
2*795d594fSAndroid Build Coastguard Worker---------------------------------
3*795d594fSAndroid Build Coastguard WorkerART occasionally needs to "suspend" threads for a variety of reasons. "Suspended" threads may
4*795d594fSAndroid Build Coastguard Workercontinue to run, but may not access data structures related to the Java heap. Please see
5*795d594fSAndroid Build Coastguard Worker`mutator_gc_coord.md` for details.
6*795d594fSAndroid Build Coastguard Worker
7*795d594fSAndroid Build Coastguard WorkerThe suspension process usually involves setting a flag for the thread to be "suspended", possibly
8*795d594fSAndroid Build Coastguard Workercausing the thread being "suspended" to generate a SIGSEGV at an opportune point, so that it
9*795d594fSAndroid Build Coastguard Workernotices the flag, and then having it acknowledge that it is now "suspended".
10*795d594fSAndroid Build Coastguard Worker
11*795d594fSAndroid Build Coastguard WorkerThis process is time-limited so that it does not hang a misbehaving process indefinitely. A
12*795d594fSAndroid Build Coastguard Workertimeout crashes the process with an abort message indicating a timeout in one of `SuspendAll`,
13*795d594fSAndroid Build Coastguard Worker`SuspendThreadByPeer`, or `SuspendThreadByThreadId`. It will normally occur after 4 seconds if the
14*795d594fSAndroid Build Coastguard Workerthread requesting the suspension has high priority, and either 8 or 12 seconds otherwise.
15*795d594fSAndroid Build Coastguard Worker
16*795d594fSAndroid Build Coastguard WorkerAny such timeout has the inherent downside that it may occur on a sufficiently overcommitted
17*795d594fSAndroid Build Coastguard Workerdevice even when there is no deadlock or similar bug involved. Clearly this should be
18*795d594fSAndroid Build Coastguard Workerextremely rare.
19*795d594fSAndroid Build Coastguard Worker
20*795d594fSAndroid Build Coastguard WorkerAndroid 15 changed the handling of such timeouts in several ways:
21*795d594fSAndroid Build Coastguard Worker
22*795d594fSAndroid Build Coastguard Worker1) The underlying suspension code was changed to improve correctness and better report timeouts.
23*795d594fSAndroid Build Coastguard WorkerThis included reducing the timeout in some cases to avoid the danger of reporting such timeouts as
24*795d594fSAndroid Build Coastguard Workerhard-to-analyze ANRs.
25*795d594fSAndroid Build Coastguard Worker
26*795d594fSAndroid Build Coastguard Worker2) When such a timeout is encountered, we now aggressively try to abort the thread refusing to
27*795d594fSAndroid Build Coastguard Workersuspend, so that the main reported stack trace gives a better indication of what went wrong. The
28*795d594fSAndroid Build Coastguard Workerthread originating the suspension request will still abort if this failed or took too long.
29*795d594fSAndroid Build Coastguard Worker
30*795d594fSAndroid Build Coastguard Worker3) The timeout abort message should contain a fair amount of information about the thread failing
31*795d594fSAndroid Build Coastguard Workerto abort, including two prefixes of the `/proc/<pid>/task/<tid>/stat` for the offending thread,
32*795d594fSAndroid Build Coastguard Workertaken a second or more apart. These snapshots contain several bits of useful information, such as
33*795d594fSAndroid Build Coastguard Workerkernel process states, the thread priority, and the `utime` and `stime` fields indicating the
34*795d594fSAndroid Build Coastguard Workeramount of time for which the thread was scheduled. See `man proc`, and look for `/proc/pid/stat`.
35*795d594fSAndroid Build Coastguard Worker(Initial Android 15 versions reported `/proc/<tid>/stat` instead, which includes process rather
36*795d594fSAndroid Build Coastguard Workerthan thread cpu time.)
37*795d594fSAndroid Build Coastguard Worker
38*795d594fSAndroid Build Coastguard WorkerThis has been known to fail for several reasons:
39*795d594fSAndroid Build Coastguard Worker
40*795d594fSAndroid Build Coastguard Worker1) A deadlock involving thread suspension. The issues here are discussed in `mutator_gc_coord.md`.
41*795d594fSAndroid Build Coastguard WorkerA common cause of these appear to be "native" C++ locks that are both held while executing Java
42*795d594fSAndroid Build Coastguard Workercode, and acquired in `@CriticalNative` or `@FastNative` JNI calls. These are clear bugs that,
43*795d594fSAndroid Build Coastguard Workeronce identified, usually have a fairly clear-cut fix.
44*795d594fSAndroid Build Coastguard Worker
45*795d594fSAndroid Build Coastguard Worker2) Overcommitting the cores, so that the thread being "suspended" just does not get a chance to
46*795d594fSAndroid Build Coastguard Workerrun within the timeout of 4 or more seconds.
47*795d594fSAndroid Build Coastguard Worker
48*795d594fSAndroid Build Coastguard Worker3) Either ART or `@CriticalNative`/`@FastNative` code that continues in Java `kRunnable` state for
49*795d594fSAndroid Build Coastguard Workertoo long without checking suspension requests.
50*795d594fSAndroid Build Coastguard Worker
51*795d594fSAndroid Build Coastguard Worker4) The thread being suspended is either itself running at a low thread priority, or is waiting for
52*795d594fSAndroid Build Coastguard Workera thread at low thread priority. A Java priority 10 thread has Linux niceness -8, but a priority 1
53*795d594fSAndroid Build Coastguard Workerthread has niceness 20. This means the former gets roughly 1.25^28. or more than 500, times the
54*795d594fSAndroid Build Coastguard Workercpu share of the latter when the device's cores are overcommitted. It is worth noting that
55*795d594fSAndroid Build Coastguard Workerpriority 5 (NORMAL) corresponds to niceness 0, while priority 4 corresponds to niceness 10, which
56*795d594fSAndroid Build Coastguard Workeris already almost a factor of 10 difference.
57*795d594fSAndroid Build Coastguard Worker
58*795d594fSAndroid Build Coastguard WorkerWhen we do see such timeouts, they are often a combination of the last 3. The fixes in such a case
59*795d594fSAndroid Build Coastguard Workertend to be less clear. Cores may become significantly overcommitted due to attempts to avoid
60*795d594fSAndroid Build Coastguard Workerunused cores, particularly during startup. There are currently times when ART needs to perform IO
61*795d594fSAndroid Build Coastguard Workeror paging operations while the Java heap is not in a consistent state. Priority issues can be
62*795d594fSAndroid Build Coastguard Workerdifficult to address, since temporary priority changes may race with other priority changes.
63*795d594fSAndroid Build Coastguard Worker
64*795d594fSAndroid Build Coastguard WorkerDifferent suspension timeout failures will usually need to be addressed individually.
65*795d594fSAndroid Build Coastguard WorkerThere is no single "silver bullet" fix for all of them. There is ongoing work
66*795d594fSAndroid Build Coastguard Workerto improve the tools available for handling priority issues. Currently the possible fixes
67*795d594fSAndroid Build Coastguard Workerinclude:
68*795d594fSAndroid Build Coastguard Worker
69*795d594fSAndroid Build Coastguard Worker- Remove any newly discovered deadlocks, e.g. by removing an `@FastNative` annotation to prevent
70*795d594fSAndroid Build Coastguard Worker  a lock from being acquired while the thread already has Java heap access. Or no longer
71*795d594fSAndroid Build Coastguard Worker  hold native locks across calls to Java.
72*795d594fSAndroid Build Coastguard Worker- Reduce the amount of time spent continuously in Java runnable state. For application code, that
73*795d594fSAndroid Build Coastguard Worker  may again involve removing `@FastNative` or `@CriticalNative` annotations. For ART internal
74*795d594fSAndroid Build Coastguard Worker  code, break up `ScopedObjectAccess` sections or the like, being careful to not hold native
75*795d594fSAndroid Build Coastguard Worker  pointers to Java heap objects across such sections.
76*795d594fSAndroid Build Coastguard Worker- Avoid excessive parallelism that is causing some threads to starve.
77*795d594fSAndroid Build Coastguard Worker- Reduce differences in thread priorities and, if necessary, avoid very low priority threads, for
78*795d594fSAndroid Build Coastguard Worker  the same reason.
79*795d594fSAndroid Build Coastguard Worker- On slow devices, if you are in a position to do so, consider setting `ro.hw_timeout_multiplier`
80*795d594fSAndroid Build Coastguard Worker  to a value greater than one.
81