1*03ce13f7SAndroid Build Coastguard WorkerRegister allocation in Subzero 2*03ce13f7SAndroid Build Coastguard Worker============================== 3*03ce13f7SAndroid Build Coastguard Worker 4*03ce13f7SAndroid Build Coastguard WorkerIntroduction 5*03ce13f7SAndroid Build Coastguard Worker------------ 6*03ce13f7SAndroid Build Coastguard Worker 7*03ce13f7SAndroid Build Coastguard Worker`Subzero 8*03ce13f7SAndroid Build Coastguard Worker<https://chromium.googlesource.com/native_client/pnacl-subzero/+/master/docs/DESIGN.rst>`_ 9*03ce13f7SAndroid Build Coastguard Workeris a fast code generator that translates architecture-independent `PNaCl bitcode 10*03ce13f7SAndroid Build Coastguard Worker<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_ into 11*03ce13f7SAndroid Build Coastguard Workerarchitecture-specific machine code. PNaCl bitcode is LLVM bitcode that has been 12*03ce13f7SAndroid Build Coastguard Workersimplified (e.g. weird-width primitive types like 57-bit integers are not 13*03ce13f7SAndroid Build Coastguard Workerallowed) and has had architecture-independent optimizations already applied. 14*03ce13f7SAndroid Build Coastguard WorkerSubzero aims to generate high-quality code as fast as practical, and as such 15*03ce13f7SAndroid Build Coastguard WorkerSubzero needs to make tradeoffs between compilation speed and output code 16*03ce13f7SAndroid Build Coastguard Workerquality. 17*03ce13f7SAndroid Build Coastguard Worker 18*03ce13f7SAndroid Build Coastguard WorkerIn Subzero, we have found register allocation to be one of the most important 19*03ce13f7SAndroid Build Coastguard Workeroptimizations toward achieving the best code quality, which is in tension with 20*03ce13f7SAndroid Build Coastguard Workerregister allocation's reputation for being complex and expensive. Linear-scan 21*03ce13f7SAndroid Build Coastguard Workerregister allocation is a modern favorite for getting fairly good register 22*03ce13f7SAndroid Build Coastguard Workerallocation at relatively low cost. Subzero uses linear-scan for its core 23*03ce13f7SAndroid Build Coastguard Workerregister allocator, with a few internal modifications to improve its 24*03ce13f7SAndroid Build Coastguard Workereffectiveness (specifically: register preference, register overlap, and register 25*03ce13f7SAndroid Build Coastguard Workeraliases). Subzero also does a few high-level things on top of its core register 26*03ce13f7SAndroid Build Coastguard Workerallocator to improve overall effectiveness (specifically: repeat until 27*03ce13f7SAndroid Build Coastguard Workerconvergence, delayed phi lowering, and local live range splitting). 28*03ce13f7SAndroid Build Coastguard Worker 29*03ce13f7SAndroid Build Coastguard WorkerWhat we describe here are techniques that have worked well for Subzero, in the 30*03ce13f7SAndroid Build Coastguard Workercontext of its particular intermediate representation (IR) and compilation 31*03ce13f7SAndroid Build Coastguard Workerstrategy. Some of these techniques may not be applicable to another compiler, 32*03ce13f7SAndroid Build Coastguard Workerdepending on its particular IR and compilation strategy. Some key concepts in 33*03ce13f7SAndroid Build Coastguard WorkerSubzero are the following: 34*03ce13f7SAndroid Build Coastguard Worker 35*03ce13f7SAndroid Build Coastguard Worker- Subzero's ``Variable`` operand is an operand that resides either on the stack 36*03ce13f7SAndroid Build Coastguard Worker or in a physical register. A Variable can be tagged as *must-have-register* 37*03ce13f7SAndroid Build Coastguard Worker or *must-not-have-register*, but its default is *may-have-register*. All uses 38*03ce13f7SAndroid Build Coastguard Worker of the Variable map to the same physical register or stack location. 39*03ce13f7SAndroid Build Coastguard Worker 40*03ce13f7SAndroid Build Coastguard Worker- Basic lowering is done before register allocation. Lowering is the process of 41*03ce13f7SAndroid Build Coastguard Worker translating PNaCl bitcode instructions into native instructions. Sometimes a 42*03ce13f7SAndroid Build Coastguard Worker native instruction, like the x86 ``add`` instruction, allows one of its 43*03ce13f7SAndroid Build Coastguard Worker Variable operands to be either in a physical register or on the stack, in 44*03ce13f7SAndroid Build Coastguard Worker which case the lowering is relatively simple. But if the lowered instruction 45*03ce13f7SAndroid Build Coastguard Worker requires the operand to be in a physical register, we generate code that 46*03ce13f7SAndroid Build Coastguard Worker copies the Variable into a *must-have-register* temporary, and then use that 47*03ce13f7SAndroid Build Coastguard Worker temporary in the lowered instruction. 48*03ce13f7SAndroid Build Coastguard Worker 49*03ce13f7SAndroid Build Coastguard Worker- Instructions within a basic block appear in a linearized order (as opposed to 50*03ce13f7SAndroid Build Coastguard Worker something like a directed acyclic graph of dependency edges). 51*03ce13f7SAndroid Build Coastguard Worker 52*03ce13f7SAndroid Build Coastguard Worker- An instruction has 0 or 1 *dest* Variables and an arbitrary (but usually 53*03ce13f7SAndroid Build Coastguard Worker small) number of *source* Variables. Assuming SSA form, the instruction 54*03ce13f7SAndroid Build Coastguard Worker begins the live range of the dest Variable, and may end the live range of one 55*03ce13f7SAndroid Build Coastguard Worker or more of the source Variables. 56*03ce13f7SAndroid Build Coastguard Worker 57*03ce13f7SAndroid Build Coastguard WorkerExecutive summary 58*03ce13f7SAndroid Build Coastguard Worker----------------- 59*03ce13f7SAndroid Build Coastguard Worker 60*03ce13f7SAndroid Build Coastguard Worker- Liveness analysis and live range construction are prerequisites for register 61*03ce13f7SAndroid Build Coastguard Worker allocation. Without careful attention, they can be potentially very 62*03ce13f7SAndroid Build Coastguard Worker expensive, especially when the number of variables and basic blocks gets very 63*03ce13f7SAndroid Build Coastguard Worker large. Subzero uses some clever data structures to take advantage of the 64*03ce13f7SAndroid Build Coastguard Worker sparsity of the data, resulting in stable performance as function size scales 65*03ce13f7SAndroid Build Coastguard Worker up. This means that the proportion of compilation time spent on liveness 66*03ce13f7SAndroid Build Coastguard Worker analysis stays roughly the same. 67*03ce13f7SAndroid Build Coastguard Worker 68*03ce13f7SAndroid Build Coastguard Worker- The core of Subzero's register allocator is taken from `Mössenböck and 69*03ce13f7SAndroid Build Coastguard Worker Pfeiffer's paper <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_ on 70*03ce13f7SAndroid Build Coastguard Worker linear-scan register allocation. 71*03ce13f7SAndroid Build Coastguard Worker 72*03ce13f7SAndroid Build Coastguard Worker- We enhance the core algorithm with a good automatic preference mechanism when 73*03ce13f7SAndroid Build Coastguard Worker more than one register is available, to try to minimize register shuffling. 74*03ce13f7SAndroid Build Coastguard Worker 75*03ce13f7SAndroid Build Coastguard Worker- We also enhance it to allow overlapping live ranges to share the same 76*03ce13f7SAndroid Build Coastguard Worker register, when one variable is recognized as a read-only copy of the other. 77*03ce13f7SAndroid Build Coastguard Worker 78*03ce13f7SAndroid Build Coastguard Worker- We deal with register aliasing in a clean and general fashion. Register 79*03ce13f7SAndroid Build Coastguard Worker aliasing is when e.g. 16-bit architectural registers share some bits with 80*03ce13f7SAndroid Build Coastguard Worker their 32-bit counterparts, or 64-bit registers are actually pairs of 32-bit 81*03ce13f7SAndroid Build Coastguard Worker registers. 82*03ce13f7SAndroid Build Coastguard Worker 83*03ce13f7SAndroid Build Coastguard Worker- We improve register allocation by rerunning the algorithm on likely candidates 84*03ce13f7SAndroid Build Coastguard Worker that didn't get registers during the previous iteration, without imposing much 85*03ce13f7SAndroid Build Coastguard Worker additional cost. 86*03ce13f7SAndroid Build Coastguard Worker 87*03ce13f7SAndroid Build Coastguard Worker- The main register allocation is done before phi lowering, because phi lowering 88*03ce13f7SAndroid Build Coastguard Worker imposes early and unnecessary ordering constraints on the resulting 89*03ce13f7SAndroid Build Coastguard Worker assigments, which create spurious interferences in live ranges. 90*03ce13f7SAndroid Build Coastguard Worker 91*03ce13f7SAndroid Build Coastguard Worker- Within each basic block, we aggressively split each variable's live range at 92*03ce13f7SAndroid Build Coastguard Worker every use, so that portions of the live range can get registers even if the 93*03ce13f7SAndroid Build Coastguard Worker whole live range can't. Doing this separately for each basic block avoids 94*03ce13f7SAndroid Build Coastguard Worker merge complications, and keeps liveness analysis and register allocation fast 95*03ce13f7SAndroid Build Coastguard Worker by fitting well into the sparsity optimizations of their data structures. 96*03ce13f7SAndroid Build Coastguard Worker 97*03ce13f7SAndroid Build Coastguard WorkerLiveness analysis 98*03ce13f7SAndroid Build Coastguard Worker----------------- 99*03ce13f7SAndroid Build Coastguard Worker 100*03ce13f7SAndroid Build Coastguard WorkerWith respect to register allocation, the main purpose of liveness analysis is to 101*03ce13f7SAndroid Build Coastguard Workercalculate the live range of each variable. The live range is represented as a 102*03ce13f7SAndroid Build Coastguard Workerset of instruction number ranges. Instruction numbers within a basic block must 103*03ce13f7SAndroid Build Coastguard Workerbe monotonically increasing, and the instruction ranges of two different basic 104*03ce13f7SAndroid Build Coastguard Workerblocks must not overlap. 105*03ce13f7SAndroid Build Coastguard Worker 106*03ce13f7SAndroid Build Coastguard WorkerBasic liveness 107*03ce13f7SAndroid Build Coastguard Worker^^^^^^^^^^^^^^ 108*03ce13f7SAndroid Build Coastguard Worker 109*03ce13f7SAndroid Build Coastguard WorkerLiveness analysis is a straightforward dataflow algorithm. For each basic 110*03ce13f7SAndroid Build Coastguard Workerblock, we keep track of the live-in and live-out set, i.e. the set of variables 111*03ce13f7SAndroid Build Coastguard Workerthat are live coming into or going out of the basic block. Processing of a 112*03ce13f7SAndroid Build Coastguard Workerbasic block starts by initializing a temporary set as the union of the live-in 113*03ce13f7SAndroid Build Coastguard Workersets of the basic block's successor blocks. (This basic block's live-out set is 114*03ce13f7SAndroid Build Coastguard Workercaptured as the initial value of the temporary set.) Then each instruction of 115*03ce13f7SAndroid Build Coastguard Workerthe basic block is processed in reverse order. All the source variables of the 116*03ce13f7SAndroid Build Coastguard Workerinstruction are marked as live, by adding them to the temporary set, and the 117*03ce13f7SAndroid Build Coastguard Workerdest variable of the instruction (if any) is marked as not-live, by removing it 118*03ce13f7SAndroid Build Coastguard Workerfrom the temporary set. When we finish processing all of the block's 119*03ce13f7SAndroid Build Coastguard Workerinstructions, we add/union the temporary set into the basic block's live-in set. 120*03ce13f7SAndroid Build Coastguard WorkerIf this changes the basic block's live-in set, then we mark all of this basic 121*03ce13f7SAndroid Build Coastguard Workerblock's predecessor blocks to be reprocessed. Then we repeat for other basic 122*03ce13f7SAndroid Build Coastguard Workerblocks until convergence, i.e. no more basic blocks are marked to be 123*03ce13f7SAndroid Build Coastguard Workerreprocessed. If basic blocks are processed in reverse topological order, then 124*03ce13f7SAndroid Build Coastguard Workerthe number of times each basic block need to be reprocessed is generally its 125*03ce13f7SAndroid Build Coastguard Workerloop nest depth. 126*03ce13f7SAndroid Build Coastguard Worker 127*03ce13f7SAndroid Build Coastguard WorkerThe result of this pass is the live-in and live-out set for each basic block. 128*03ce13f7SAndroid Build Coastguard Worker 129*03ce13f7SAndroid Build Coastguard WorkerWith so many set operations, choice of data structure is crucial for 130*03ce13f7SAndroid Build Coastguard Workerperformance. We tried a few options, and found that a simple dense bit vector 131*03ce13f7SAndroid Build Coastguard Workerworks best. This keeps the per-instruction cost very low. However, we found 132*03ce13f7SAndroid Build Coastguard Workerthat when the function gets very large, merging and copying bit vectors at basic 133*03ce13f7SAndroid Build Coastguard Workerblock boundaries dominates the cost. This is due to the number of variables 134*03ce13f7SAndroid Build Coastguard Worker(hence the bit vector size) and the number of basic blocks getting large. 135*03ce13f7SAndroid Build Coastguard Worker 136*03ce13f7SAndroid Build Coastguard WorkerA simple enhancement brought this under control in Subzero. It turns out that 137*03ce13f7SAndroid Build Coastguard Workerthe vast majority of variables are referenced, and therefore live, only in a 138*03ce13f7SAndroid Build Coastguard Workersingle basic block. This is largely due to the SSA form of PNaCl bitcode. To 139*03ce13f7SAndroid Build Coastguard Workertake advantage of this, we partition variables by single-block versus 140*03ce13f7SAndroid Build Coastguard Workermulti-block liveness. Multi-block variables get lower-numbered bit vector 141*03ce13f7SAndroid Build Coastguard Workerindexes, and single-block variables get higher-number indexes. Single-block bit 142*03ce13f7SAndroid Build Coastguard Workervector indexes are reused across different basic blocks. As such, the size of 143*03ce13f7SAndroid Build Coastguard Workerlive-in and live-out bit vectors is limited to the number of multi-block 144*03ce13f7SAndroid Build Coastguard Workervariables, and the temporary set's size can be limited to that plus the largest 145*03ce13f7SAndroid Build Coastguard Workernumber of single-block variables across all basic blocks. 146*03ce13f7SAndroid Build Coastguard Worker 147*03ce13f7SAndroid Build Coastguard WorkerFor the purpose of live range construction, we also need to track definitions 148*03ce13f7SAndroid Build Coastguard Worker(LiveBegin) and last-uses (LiveEnd) of variables used within instructions of the 149*03ce13f7SAndroid Build Coastguard Workerbasic block. These are easy to detect while processing the instructions; data 150*03ce13f7SAndroid Build Coastguard Workerstructure choices are described below. 151*03ce13f7SAndroid Build Coastguard Worker 152*03ce13f7SAndroid Build Coastguard WorkerLive range construction 153*03ce13f7SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^ 154*03ce13f7SAndroid Build Coastguard Worker 155*03ce13f7SAndroid Build Coastguard WorkerAfter the live-in and live-out sets are calculated, we construct each variable's 156*03ce13f7SAndroid Build Coastguard Workerlive range (as an ordered list of instruction ranges, described above). We do 157*03ce13f7SAndroid Build Coastguard Workerthis by considering the live-in and live-out sets, combined with LiveBegin and 158*03ce13f7SAndroid Build Coastguard WorkerLiveEnd information. This is done separately for each basic block. 159*03ce13f7SAndroid Build Coastguard Worker 160*03ce13f7SAndroid Build Coastguard WorkerAs before, we need to take advantage of sparsity of variable uses across basic 161*03ce13f7SAndroid Build Coastguard Workerblocks, to avoid overly copying/merging data structures. The following is what 162*03ce13f7SAndroid Build Coastguard Workerworked well for Subzero (after trying several other options). 163*03ce13f7SAndroid Build Coastguard Worker 164*03ce13f7SAndroid Build Coastguard WorkerThe basic liveness pass, described above, keeps track of when a variable's live 165*03ce13f7SAndroid Build Coastguard Workerrange begins or ends within the block. LiveBegin and LiveEnd are unordered 166*03ce13f7SAndroid Build Coastguard Workervectors where each element is a pair of the variable and the instruction number, 167*03ce13f7SAndroid Build Coastguard Workerrepresenting that the particular variable's live range begins or ends at the 168*03ce13f7SAndroid Build Coastguard Workerparticular instruction. When the liveness pass finds a variable whose live 169*03ce13f7SAndroid Build Coastguard Workerrange begins or ends, it appends and entry to LiveBegin or LiveEnd. 170*03ce13f7SAndroid Build Coastguard Worker 171*03ce13f7SAndroid Build Coastguard WorkerDuring live range construction, the LiveBegin and LiveEnd vectors are sorted by 172*03ce13f7SAndroid Build Coastguard Workervariable number. Then we iterate across both vectors in parallel. If a 173*03ce13f7SAndroid Build Coastguard Workervariable appears in both LiveBegin and LiveEnd, then its live range is entirely 174*03ce13f7SAndroid Build Coastguard Workerwithin this block. If it appears in only LiveBegin, then its live range starts 175*03ce13f7SAndroid Build Coastguard Workerhere and extends through the end of the block. If it appears in only LiveEnd, 176*03ce13f7SAndroid Build Coastguard Workerthen its live range starts at the beginning of the block and ends here. (Note 177*03ce13f7SAndroid Build Coastguard Workerthat this only covers the live range within this block, and this process is 178*03ce13f7SAndroid Build Coastguard Workerrepeated across all blocks.) 179*03ce13f7SAndroid Build Coastguard Worker 180*03ce13f7SAndroid Build Coastguard WorkerIt is also possible that a variable is live within this block but its live range 181*03ce13f7SAndroid Build Coastguard Workerdoes not begin or end in this block. These variables are identified simply by 182*03ce13f7SAndroid Build Coastguard Workertaking the intersection of the live-in and live-out sets. 183*03ce13f7SAndroid Build Coastguard Worker 184*03ce13f7SAndroid Build Coastguard WorkerAs a result of these data structures, performance of liveness analysis and live 185*03ce13f7SAndroid Build Coastguard Workerrange construction tend to be stable across small, medium, and large functions, 186*03ce13f7SAndroid Build Coastguard Workeras measured by a fairly consistent proportion of total compilation time spent on 187*03ce13f7SAndroid Build Coastguard Workerthe liveness passes. 188*03ce13f7SAndroid Build Coastguard Worker 189*03ce13f7SAndroid Build Coastguard WorkerLinear-scan register allocation 190*03ce13f7SAndroid Build Coastguard Worker------------------------------- 191*03ce13f7SAndroid Build Coastguard Worker 192*03ce13f7SAndroid Build Coastguard WorkerThe basis of Subzero's register allocator is the allocator described by 193*03ce13f7SAndroid Build Coastguard WorkerHanspeter Mössenböck and Michael Pfeiffer in `Linear Scan Register Allocation in 194*03ce13f7SAndroid Build Coastguard Workerthe Context of SSA Form and Register Constraints 195*03ce13f7SAndroid Build Coastguard Worker<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. It allows live ranges to 196*03ce13f7SAndroid Build Coastguard Workerbe a union of intervals rather than a single conservative interval, and it 197*03ce13f7SAndroid Build Coastguard Workerallows pre-coloring of variables with specific physical registers. 198*03ce13f7SAndroid Build Coastguard Worker 199*03ce13f7SAndroid Build Coastguard WorkerThe paper suggests an approach of aggressively coalescing variables across Phi 200*03ce13f7SAndroid Build Coastguard Workerinstructions (i.e., trying to force Phi source and dest variables to have the 201*03ce13f7SAndroid Build Coastguard Workersame register assignment), but we omit that in favor of the more natural 202*03ce13f7SAndroid Build Coastguard Workerpreference mechanism described below. 203*03ce13f7SAndroid Build Coastguard Worker 204*03ce13f7SAndroid Build Coastguard WorkerWe found the paper quite remarkable in that a straightforward implementation of 205*03ce13f7SAndroid Build Coastguard Workerits pseudo-code led to an entirely correct register allocator. The only thing 206*03ce13f7SAndroid Build Coastguard Workerwe found in the specification that was even close to a mistake is that it was 207*03ce13f7SAndroid Build Coastguard Workertoo aggressive in evicting inactive ranges in the "else" clause of the 208*03ce13f7SAndroid Build Coastguard WorkerAssignMemLoc routine. An inactive range only needs to be evicted if it actually 209*03ce13f7SAndroid Build Coastguard Workeroverlaps the current range being considered, whereas the paper evicts it 210*03ce13f7SAndroid Build Coastguard Workerunconditionally. (Search for "original paper" in Subzero's register allocator 211*03ce13f7SAndroid Build Coastguard Workersource code.) 212*03ce13f7SAndroid Build Coastguard Worker 213*03ce13f7SAndroid Build Coastguard WorkerRegister preference 214*03ce13f7SAndroid Build Coastguard Worker------------------- 215*03ce13f7SAndroid Build Coastguard Worker 216*03ce13f7SAndroid Build Coastguard WorkerThe linear-scan algorithm from the paper talks about choosing an available 217*03ce13f7SAndroid Build Coastguard Workerregister, but isn't specific on how to choose among several available registers. 218*03ce13f7SAndroid Build Coastguard WorkerThe simplest approach is to just choose the first available register, e.g. the 219*03ce13f7SAndroid Build Coastguard Workerlowest-numbered register. Often a better choice is possible. 220*03ce13f7SAndroid Build Coastguard Worker 221*03ce13f7SAndroid Build Coastguard WorkerSpecifically, if variable ``V`` is assigned in an instruction ``V=f(S1,S2,...)`` 222*03ce13f7SAndroid Build Coastguard Workerwith source variables ``S1,S2,...``, and that instruction ends the live range of 223*03ce13f7SAndroid Build Coastguard Workerone of those source variables ``Sn``, and ``Sn`` was assigned a register, then 224*03ce13f7SAndroid Build Coastguard Worker``Sn``'s register is usually a good choice for ``V``. This is especially true 225*03ce13f7SAndroid Build Coastguard Workerwhen the instruction is a simple assignment, because an assignment where the 226*03ce13f7SAndroid Build Coastguard Workerdest and source variables end up with the same register can be trivially elided, 227*03ce13f7SAndroid Build Coastguard Workerreducing the amount of register-shuffling code. 228*03ce13f7SAndroid Build Coastguard Worker 229*03ce13f7SAndroid Build Coastguard WorkerThis requires being able to find and inspect a variable's defining instruction, 230*03ce13f7SAndroid Build Coastguard Workerwhich is not an assumption in the original paper but is easily tracked in 231*03ce13f7SAndroid Build Coastguard Workerpractice. 232*03ce13f7SAndroid Build Coastguard Worker 233*03ce13f7SAndroid Build Coastguard WorkerRegister overlap 234*03ce13f7SAndroid Build Coastguard Worker---------------- 235*03ce13f7SAndroid Build Coastguard Worker 236*03ce13f7SAndroid Build Coastguard WorkerBecause Subzero does register allocation after basic lowering, the lowering has 237*03ce13f7SAndroid Build Coastguard Workerto be prepared for the possibility of any given program variable not getting a 238*03ce13f7SAndroid Build Coastguard Workerphysical register. It does this by introducing *must-have-register* temporary 239*03ce13f7SAndroid Build Coastguard Workervariables, and copies the program variable into the temporary to ensure that 240*03ce13f7SAndroid Build Coastguard Workerregister requirements in the target instruction are met. 241*03ce13f7SAndroid Build Coastguard Worker 242*03ce13f7SAndroid Build Coastguard WorkerIn many cases, the program variable and temporary variable have overlapping live 243*03ce13f7SAndroid Build Coastguard Workerranges, and would be forced to have different registers even if the temporary 244*03ce13f7SAndroid Build Coastguard Workervariable is effectively a read-only copy of the program variable. We recognize 245*03ce13f7SAndroid Build Coastguard Workerthis when the program variable has no definitions within the temporary 246*03ce13f7SAndroid Build Coastguard Workervariable's live range, and the temporary variable has no definitions within the 247*03ce13f7SAndroid Build Coastguard Workerprogram variable's live range with the exception of the copy assignment. 248*03ce13f7SAndroid Build Coastguard Worker 249*03ce13f7SAndroid Build Coastguard WorkerThis analysis is done as part of register preference detection. 250*03ce13f7SAndroid Build Coastguard Worker 251*03ce13f7SAndroid Build Coastguard WorkerThe main impact on the linear-scan implementation is that instead of 252*03ce13f7SAndroid Build Coastguard Workersetting/resetting a boolean flag to indicate whether a register is free or in 253*03ce13f7SAndroid Build Coastguard Workeruse, we increment/decrement a number-of-uses counter. 254*03ce13f7SAndroid Build Coastguard Worker 255*03ce13f7SAndroid Build Coastguard WorkerRegister aliases 256*03ce13f7SAndroid Build Coastguard Worker---------------- 257*03ce13f7SAndroid Build Coastguard Worker 258*03ce13f7SAndroid Build Coastguard WorkerSometimes registers of different register classes partially overlap. For 259*03ce13f7SAndroid Build Coastguard Workerexample, in x86, registers ``al`` and ``ah`` alias ``ax`` (though they don't 260*03ce13f7SAndroid Build Coastguard Workeralias each other), and all three alias ``eax`` and ``rax``. And in ARM, 261*03ce13f7SAndroid Build Coastguard Workerregisters ``s0`` and ``s1`` (which are single-precision floating-point 262*03ce13f7SAndroid Build Coastguard Workerregisters) alias ``d0`` (double-precision floating-point), and registers ``d0`` 263*03ce13f7SAndroid Build Coastguard Workerand ``d1`` alias ``q0`` (128-bit vector register). The linear-scan paper 264*03ce13f7SAndroid Build Coastguard Workerdoesn't address this issue; it assumes that all registers are independent. A 265*03ce13f7SAndroid Build Coastguard Workersimple solution is to essentially avoid aliasing by disallowing a subset of the 266*03ce13f7SAndroid Build Coastguard Workerregisters, but there is obviously a reduction in code quality when e.g. half of 267*03ce13f7SAndroid Build Coastguard Workerthe registers are taken away. 268*03ce13f7SAndroid Build Coastguard Worker 269*03ce13f7SAndroid Build Coastguard WorkerSubzero handles this more elegantly. For each register, we keep a bitmask 270*03ce13f7SAndroid Build Coastguard Workerindicating which registers alias/conflict with it. For example, in x86, 271*03ce13f7SAndroid Build Coastguard Worker``ah``'s alias set is ``ah``, ``ax``, ``eax``, and ``rax`` (but not ``al``), and 272*03ce13f7SAndroid Build Coastguard Workerin ARM, ``s1``'s alias set is ``s1``, ``d0``, and ``q0``. Whenever we want to 273*03ce13f7SAndroid Build Coastguard Workermark a register as being used or released, we do the same for all of its 274*03ce13f7SAndroid Build Coastguard Workeraliases. 275*03ce13f7SAndroid Build Coastguard Worker 276*03ce13f7SAndroid Build Coastguard WorkerBefore implementing this, we were a bit apprehensive about the compile-time 277*03ce13f7SAndroid Build Coastguard Workerperformance impact. Instead of setting one bit in a bit vector or decrementing 278*03ce13f7SAndroid Build Coastguard Workerone counter, this generally needs to be done in a loop that iterates over all 279*03ce13f7SAndroid Build Coastguard Workeraliases. Fortunately, this seemed to make very little difference in 280*03ce13f7SAndroid Build Coastguard Workerperformance, as the bulk of the cost ends up being in live range overlap 281*03ce13f7SAndroid Build Coastguard Workercomputations, which are not affected by register aliasing. 282*03ce13f7SAndroid Build Coastguard Worker 283*03ce13f7SAndroid Build Coastguard WorkerRepeat until convergence 284*03ce13f7SAndroid Build Coastguard Worker------------------------ 285*03ce13f7SAndroid Build Coastguard Worker 286*03ce13f7SAndroid Build Coastguard WorkerSometimes the linear-scan algorithm makes a register assignment only to later 287*03ce13f7SAndroid Build Coastguard Workerrevoke it in favor of a higher-priority variable, but it turns out that a 288*03ce13f7SAndroid Build Coastguard Workerdifferent initial register choice would not have been revoked. For relatively 289*03ce13f7SAndroid Build Coastguard Workerlow compile-time cost, we can give those variables another chance. 290*03ce13f7SAndroid Build Coastguard Worker 291*03ce13f7SAndroid Build Coastguard WorkerDuring register allocation, we keep track of the revoked variables and then do 292*03ce13f7SAndroid Build Coastguard Workeranother round of register allocation targeted only to that set. We repeat until 293*03ce13f7SAndroid Build Coastguard Workerno new register assignments are made, which is usually just a handful of 294*03ce13f7SAndroid Build Coastguard Workersuccessively cheaper iterations. 295*03ce13f7SAndroid Build Coastguard Worker 296*03ce13f7SAndroid Build Coastguard WorkerAnother approach would be to repeat register allocation for *all* variables that 297*03ce13f7SAndroid Build Coastguard Workerhaven't had a register assigned (rather than variables that got a register that 298*03ce13f7SAndroid Build Coastguard Workerwas subsequently revoked), but our experience is that this greatly increases 299*03ce13f7SAndroid Build Coastguard Workercompile-time cost, with little or no code quality gain. 300*03ce13f7SAndroid Build Coastguard Worker 301*03ce13f7SAndroid Build Coastguard WorkerDelayed Phi lowering 302*03ce13f7SAndroid Build Coastguard Worker-------------------- 303*03ce13f7SAndroid Build Coastguard Worker 304*03ce13f7SAndroid Build Coastguard WorkerThe linear-scan algorithm works for phi instructions as well as regular 305*03ce13f7SAndroid Build Coastguard Workerinstructions, but it is tempting to lower phi instructions into non-SSA 306*03ce13f7SAndroid Build Coastguard Workerassignments before register allocation, so that register allocation need only 307*03ce13f7SAndroid Build Coastguard Workerhappen once. 308*03ce13f7SAndroid Build Coastguard Worker 309*03ce13f7SAndroid Build Coastguard WorkerUnfortunately, simple phi lowering imposes an arbitrary ordering on the 310*03ce13f7SAndroid Build Coastguard Workerresulting assignments that can cause artificial overlap/interference between 311*03ce13f7SAndroid Build Coastguard Workerlowered assignments, and can lead to worse register allocation decisions. As a 312*03ce13f7SAndroid Build Coastguard Workersimple example, consider these two phi instructions which are semantically 313*03ce13f7SAndroid Build Coastguard Workerunordered:: 314*03ce13f7SAndroid Build Coastguard Worker 315*03ce13f7SAndroid Build Coastguard Worker A = phi(B) // B's live range ends here 316*03ce13f7SAndroid Build Coastguard Worker C = phi(D) // D's live range ends here 317*03ce13f7SAndroid Build Coastguard Worker 318*03ce13f7SAndroid Build Coastguard WorkerA straightforward lowering might yield:: 319*03ce13f7SAndroid Build Coastguard Worker 320*03ce13f7SAndroid Build Coastguard Worker A = B // end of B's live range 321*03ce13f7SAndroid Build Coastguard Worker C = D // end of D's live range 322*03ce13f7SAndroid Build Coastguard Worker 323*03ce13f7SAndroid Build Coastguard WorkerThe potential problem here is that A and D's live ranges overlap, and so they 324*03ce13f7SAndroid Build Coastguard Workerare prevented from having the same register. Swapping the order of lowered 325*03ce13f7SAndroid Build Coastguard Workerassignments fixes that (but then B and C would overlap), but we can't really 326*03ce13f7SAndroid Build Coastguard Workerknow which is better until after register allocation. 327*03ce13f7SAndroid Build Coastguard Worker 328*03ce13f7SAndroid Build Coastguard WorkerSubzero deals with this by doing the main register allocation before phi 329*03ce13f7SAndroid Build Coastguard Workerlowering, followed by phi lowering, and finally a special register allocation 330*03ce13f7SAndroid Build Coastguard Workerpass limited to the new lowered assignments. 331*03ce13f7SAndroid Build Coastguard Worker 332*03ce13f7SAndroid Build Coastguard WorkerPhi lowering considers the phi operands separately for each predecessor edge, 333*03ce13f7SAndroid Build Coastguard Workerand starts by finding a topological sort of the Phi instructions, such that 334*03ce13f7SAndroid Build Coastguard Workerassignments can be executed in that order without violating dependencies on 335*03ce13f7SAndroid Build Coastguard Workerregisters or stack locations. If a topological sort is not possible due to a 336*03ce13f7SAndroid Build Coastguard Workercycle, the cycle is broken by introducing a temporary, e.g. ``A=B;B=C;C=A`` can 337*03ce13f7SAndroid Build Coastguard Workerbecome ``T=A;A=B;B=C;C=T``. The topological order is tuned to favor freeing up 338*03ce13f7SAndroid Build Coastguard Workerregisters early to reduce register pressure. 339*03ce13f7SAndroid Build Coastguard Worker 340*03ce13f7SAndroid Build Coastguard WorkerIt then lowers the linearized assignments into machine instructions (which may 341*03ce13f7SAndroid Build Coastguard Workerrequire extra physical registers e.g. to copy from one stack location to 342*03ce13f7SAndroid Build Coastguard Workeranother), and finally runs the register allocator limited to those instructions. 343*03ce13f7SAndroid Build Coastguard Worker 344*03ce13f7SAndroid Build Coastguard WorkerIn rare cases, the register allocator may fail on some *must-have-register* 345*03ce13f7SAndroid Build Coastguard Workervariable, because register pressure is too high to satisfy requirements arising 346*03ce13f7SAndroid Build Coastguard Workerfrom cycle-breaking temporaries and registers required for stack-to-stack 347*03ce13f7SAndroid Build Coastguard Workercopies. In this case, we have to find a register with no active uses within the 348*03ce13f7SAndroid Build Coastguard Workervariable's live range, and actively spill/restore that register around the live 349*03ce13f7SAndroid Build Coastguard Workerrange. This makes the code quality suffer and may be slow to implement 350*03ce13f7SAndroid Build Coastguard Workerdepending on compiler data structures, but in practice we find the situation to 351*03ce13f7SAndroid Build Coastguard Workerbe vanishingly rare and so not really worth optimizing. 352*03ce13f7SAndroid Build Coastguard Worker 353*03ce13f7SAndroid Build Coastguard WorkerLocal live range splitting 354*03ce13f7SAndroid Build Coastguard Worker-------------------------- 355*03ce13f7SAndroid Build Coastguard Worker 356*03ce13f7SAndroid Build Coastguard WorkerThe basic linear-scan algorithm has an "all-or-nothing" policy: a variable gets 357*03ce13f7SAndroid Build Coastguard Workera register for its entire live range, or not at all. This is unfortunate when a 358*03ce13f7SAndroid Build Coastguard Workervariable has many uses close together, but ultimately a large enough live range 359*03ce13f7SAndroid Build Coastguard Workerto prevent register assignment. Another bad example is on x86 where all vector 360*03ce13f7SAndroid Build Coastguard Workerand floating-point registers (the ``xmm`` registers) are killed by call 361*03ce13f7SAndroid Build Coastguard Workerinstructions, per the x86 call ABI, so such variables are completely prevented 362*03ce13f7SAndroid Build Coastguard Workerfrom having a register when their live ranges contain a call instruction. 363*03ce13f7SAndroid Build Coastguard Worker 364*03ce13f7SAndroid Build Coastguard WorkerThe general solution here is some policy for splitting live ranges. A variable 365*03ce13f7SAndroid Build Coastguard Workercan be split into multiple copies and each can be register-allocated separately. 366*03ce13f7SAndroid Build Coastguard WorkerThe complication comes in finding a sane policy for where and when to split 367*03ce13f7SAndroid Build Coastguard Workervariables such that complexity doesn't explode, and how to join the different 368*03ce13f7SAndroid Build Coastguard Workervalues at merge points. 369*03ce13f7SAndroid Build Coastguard Worker 370*03ce13f7SAndroid Build Coastguard WorkerSubzero implements aggressive block-local splitting of variables. Each basic 371*03ce13f7SAndroid Build Coastguard Workerblock is handled separately and independently. Within the block, we maintain a 372*03ce13f7SAndroid Build Coastguard Workertable ``T`` that maps each variable ``V`` to its split version ``T[V]``, with 373*03ce13f7SAndroid Build Coastguard Workerevery variable ``V``'s initial state set (implicitly) as ``T[V]=V``. For each 374*03ce13f7SAndroid Build Coastguard Workerinstruction in the block, and for each *may-have-register* variable ``V`` in the 375*03ce13f7SAndroid Build Coastguard Workerinstruction, we do the following: 376*03ce13f7SAndroid Build Coastguard Worker 377*03ce13f7SAndroid Build Coastguard Worker- Replace all uses of ``V`` in the instruction with ``T[V]``. 378*03ce13f7SAndroid Build Coastguard Worker 379*03ce13f7SAndroid Build Coastguard Worker- Create a new split variable ``V'``. 380*03ce13f7SAndroid Build Coastguard Worker 381*03ce13f7SAndroid Build Coastguard Worker- Add a new assignment ``V'=T[V]``, placing it adjacent to (either immediately 382*03ce13f7SAndroid Build Coastguard Worker before or immediately after) the current instruction. 383*03ce13f7SAndroid Build Coastguard Worker 384*03ce13f7SAndroid Build Coastguard Worker- Update ``T[V]`` to be ``V'``. 385*03ce13f7SAndroid Build Coastguard Worker 386*03ce13f7SAndroid Build Coastguard WorkerThis leads to a chain of copies of ``V`` through the block, linked by assignment 387*03ce13f7SAndroid Build Coastguard Workerinstructions. The live ranges of these copies are usually much smaller, and 388*03ce13f7SAndroid Build Coastguard Workermore likely to get registers. In fact, because of the preference mechanism 389*03ce13f7SAndroid Build Coastguard Workerdescribed above, they are likely to get the same register whenever possible. 390*03ce13f7SAndroid Build Coastguard Worker 391*03ce13f7SAndroid Build Coastguard WorkerOne obvious question comes up: won't this proliferation of new variables cause 392*03ce13f7SAndroid Build Coastguard Workeran explosion in the running time of liveness analysis and register allocation? 393*03ce13f7SAndroid Build Coastguard WorkerAs it turns out, not really. 394*03ce13f7SAndroid Build Coastguard Worker 395*03ce13f7SAndroid Build Coastguard WorkerFirst, for register allocation, the cost tends to be dominated by live range 396*03ce13f7SAndroid Build Coastguard Workeroverlap computations, whose cost is roughly proportional to the size of the live 397*03ce13f7SAndroid Build Coastguard Workerrange. All the new variable copies' live ranges sum up to the original 398*03ce13f7SAndroid Build Coastguard Workervariable's live range, so the cost isn't vastly greater. 399*03ce13f7SAndroid Build Coastguard Worker 400*03ce13f7SAndroid Build Coastguard WorkerSecond, for liveness analysis, the cost is dominated by merging bit vectors 401*03ce13f7SAndroid Build Coastguard Workercorresponding to the set of variables that have multi-block liveness. All the 402*03ce13f7SAndroid Build Coastguard Workernew copies are guaranteed to be single-block, so the main additional cost is 403*03ce13f7SAndroid Build Coastguard Workerthat of processing the new assignments. 404*03ce13f7SAndroid Build Coastguard Worker 405*03ce13f7SAndroid Build Coastguard WorkerThere's one other key issue here. The original variable and all of its copies 406*03ce13f7SAndroid Build Coastguard Workerneed to be "linked", in the sense that all of these variables that get a stack 407*03ce13f7SAndroid Build Coastguard Workerslot (because they didn't get a register) are guaranteed to have the same stack 408*03ce13f7SAndroid Build Coastguard Workerslot. This way, we can avoid generating any code related to ``V'=V`` when 409*03ce13f7SAndroid Build Coastguard Workerneither gets a register. In addition, we can elide instructions that write a 410*03ce13f7SAndroid Build Coastguard Workervalue to a stack slot that is linked to another stack slot, because that is 411*03ce13f7SAndroid Build Coastguard Workerguaranteed to be just rewriting the same value to the stack. 412