1*795d594fSAndroid Build Coastguard WorkerThread Suspension timeouts in ART 2*795d594fSAndroid Build Coastguard Worker--------------------------------- 3*795d594fSAndroid Build Coastguard WorkerART occasionally needs to "suspend" threads for a variety of reasons. "Suspended" threads may 4*795d594fSAndroid Build Coastguard Workercontinue to run, but may not access data structures related to the Java heap. Please see 5*795d594fSAndroid Build Coastguard Worker`mutator_gc_coord.md` for details. 6*795d594fSAndroid Build Coastguard Worker 7*795d594fSAndroid Build Coastguard WorkerThe suspension process usually involves setting a flag for the thread to be "suspended", possibly 8*795d594fSAndroid Build Coastguard Workercausing the thread being "suspended" to generate a SIGSEGV at an opportune point, so that it 9*795d594fSAndroid Build Coastguard Workernotices the flag, and then having it acknowledge that it is now "suspended". 10*795d594fSAndroid Build Coastguard Worker 11*795d594fSAndroid Build Coastguard WorkerThis process is time-limited so that it does not hang a misbehaving process indefinitely. A 12*795d594fSAndroid Build Coastguard Workertimeout crashes the process with an abort message indicating a timeout in one of `SuspendAll`, 13*795d594fSAndroid Build Coastguard Worker`SuspendThreadByPeer`, or `SuspendThreadByThreadId`. It will normally occur after 4 seconds if the 14*795d594fSAndroid Build Coastguard Workerthread requesting the suspension has high priority, and either 8 or 12 seconds otherwise. 15*795d594fSAndroid Build Coastguard Worker 16*795d594fSAndroid Build Coastguard WorkerAny such timeout has the inherent downside that it may occur on a sufficiently overcommitted 17*795d594fSAndroid Build Coastguard Workerdevice even when there is no deadlock or similar bug involved. Clearly this should be 18*795d594fSAndroid Build Coastguard Workerextremely rare. 19*795d594fSAndroid Build Coastguard Worker 20*795d594fSAndroid Build Coastguard WorkerAndroid 15 changed the handling of such timeouts in several ways: 21*795d594fSAndroid Build Coastguard Worker 22*795d594fSAndroid Build Coastguard Worker1) The underlying suspension code was changed to improve correctness and better report timeouts. 23*795d594fSAndroid Build Coastguard WorkerThis included reducing the timeout in some cases to avoid the danger of reporting such timeouts as 24*795d594fSAndroid Build Coastguard Workerhard-to-analyze ANRs. 25*795d594fSAndroid Build Coastguard Worker 26*795d594fSAndroid Build Coastguard Worker2) When such a timeout is encountered, we now aggressively try to abort the thread refusing to 27*795d594fSAndroid Build Coastguard Workersuspend, so that the main reported stack trace gives a better indication of what went wrong. The 28*795d594fSAndroid Build Coastguard Workerthread originating the suspension request will still abort if this failed or took too long. 29*795d594fSAndroid Build Coastguard Worker 30*795d594fSAndroid Build Coastguard Worker3) The timeout abort message should contain a fair amount of information about the thread failing 31*795d594fSAndroid Build Coastguard Workerto abort, including two prefixes of the `/proc/<pid>/task/<tid>/stat` for the offending thread, 32*795d594fSAndroid Build Coastguard Workertaken a second or more apart. These snapshots contain several bits of useful information, such as 33*795d594fSAndroid Build Coastguard Workerkernel process states, the thread priority, and the `utime` and `stime` fields indicating the 34*795d594fSAndroid Build Coastguard Workeramount of time for which the thread was scheduled. See `man proc`, and look for `/proc/pid/stat`. 35*795d594fSAndroid Build Coastguard Worker(Initial Android 15 versions reported `/proc/<tid>/stat` instead, which includes process rather 36*795d594fSAndroid Build Coastguard Workerthan thread cpu time.) 37*795d594fSAndroid Build Coastguard Worker 38*795d594fSAndroid Build Coastguard WorkerThis has been known to fail for several reasons: 39*795d594fSAndroid Build Coastguard Worker 40*795d594fSAndroid Build Coastguard Worker1) A deadlock involving thread suspension. The issues here are discussed in `mutator_gc_coord.md`. 41*795d594fSAndroid Build Coastguard WorkerA common cause of these appear to be "native" C++ locks that are both held while executing Java 42*795d594fSAndroid Build Coastguard Workercode, and acquired in `@CriticalNative` or `@FastNative` JNI calls. These are clear bugs that, 43*795d594fSAndroid Build Coastguard Workeronce identified, usually have a fairly clear-cut fix. 44*795d594fSAndroid Build Coastguard Worker 45*795d594fSAndroid Build Coastguard Worker2) Overcommitting the cores, so that the thread being "suspended" just does not get a chance to 46*795d594fSAndroid Build Coastguard Workerrun within the timeout of 4 or more seconds. 47*795d594fSAndroid Build Coastguard Worker 48*795d594fSAndroid Build Coastguard Worker3) Either ART or `@CriticalNative`/`@FastNative` code that continues in Java `kRunnable` state for 49*795d594fSAndroid Build Coastguard Workertoo long without checking suspension requests. 50*795d594fSAndroid Build Coastguard Worker 51*795d594fSAndroid Build Coastguard Worker4) The thread being suspended is either itself running at a low thread priority, or is waiting for 52*795d594fSAndroid Build Coastguard Workera thread at low thread priority. A Java priority 10 thread has Linux niceness -8, but a priority 1 53*795d594fSAndroid Build Coastguard Workerthread has niceness 20. This means the former gets roughly 1.25^28. or more than 500, times the 54*795d594fSAndroid Build Coastguard Workercpu share of the latter when the device's cores are overcommitted. It is worth noting that 55*795d594fSAndroid Build Coastguard Workerpriority 5 (NORMAL) corresponds to niceness 0, while priority 4 corresponds to niceness 10, which 56*795d594fSAndroid Build Coastguard Workeris already almost a factor of 10 difference. 57*795d594fSAndroid Build Coastguard Worker 58*795d594fSAndroid Build Coastguard WorkerWhen we do see such timeouts, they are often a combination of the last 3. The fixes in such a case 59*795d594fSAndroid Build Coastguard Workertend to be less clear. Cores may become significantly overcommitted due to attempts to avoid 60*795d594fSAndroid Build Coastguard Workerunused cores, particularly during startup. There are currently times when ART needs to perform IO 61*795d594fSAndroid Build Coastguard Workeror paging operations while the Java heap is not in a consistent state. Priority issues can be 62*795d594fSAndroid Build Coastguard Workerdifficult to address, since temporary priority changes may race with other priority changes. 63*795d594fSAndroid Build Coastguard Worker 64*795d594fSAndroid Build Coastguard WorkerDifferent suspension timeout failures will usually need to be addressed individually. 65*795d594fSAndroid Build Coastguard WorkerThere is no single "silver bullet" fix for all of them. There is ongoing work 66*795d594fSAndroid Build Coastguard Workerto improve the tools available for handling priority issues. Currently the possible fixes 67*795d594fSAndroid Build Coastguard Workerinclude: 68*795d594fSAndroid Build Coastguard Worker 69*795d594fSAndroid Build Coastguard Worker- Remove any newly discovered deadlocks, e.g. by removing an `@FastNative` annotation to prevent 70*795d594fSAndroid Build Coastguard Worker a lock from being acquired while the thread already has Java heap access. Or no longer 71*795d594fSAndroid Build Coastguard Worker hold native locks across calls to Java. 72*795d594fSAndroid Build Coastguard Worker- Reduce the amount of time spent continuously in Java runnable state. For application code, that 73*795d594fSAndroid Build Coastguard Worker may again involve removing `@FastNative` or `@CriticalNative` annotations. For ART internal 74*795d594fSAndroid Build Coastguard Worker code, break up `ScopedObjectAccess` sections or the like, being careful to not hold native 75*795d594fSAndroid Build Coastguard Worker pointers to Java heap objects across such sections. 76*795d594fSAndroid Build Coastguard Worker- Avoid excessive parallelism that is causing some threads to starve. 77*795d594fSAndroid Build Coastguard Worker- Reduce differences in thread priorities and, if necessary, avoid very low priority threads, for 78*795d594fSAndroid Build Coastguard Worker the same reason. 79*795d594fSAndroid Build Coastguard Worker- On slow devices, if you are in a position to do so, consider setting `ro.hw_timeout_multiplier` 80*795d594fSAndroid Build Coastguard Worker to a value greater than one. 81