xref: /aosp_15_r20/external/swiftshader/third_party/subzero/docs/REGALLOC.rst (revision 03ce13f70fcc45d86ee91b7ee4cab1936a95046e)
1*03ce13f7SAndroid Build Coastguard WorkerRegister allocation in Subzero
2*03ce13f7SAndroid Build Coastguard Worker==============================
3*03ce13f7SAndroid Build Coastguard Worker
4*03ce13f7SAndroid Build Coastguard WorkerIntroduction
5*03ce13f7SAndroid Build Coastguard Worker------------
6*03ce13f7SAndroid Build Coastguard Worker
7*03ce13f7SAndroid Build Coastguard Worker`Subzero
8*03ce13f7SAndroid Build Coastguard Worker<https://chromium.googlesource.com/native_client/pnacl-subzero/+/master/docs/DESIGN.rst>`_
9*03ce13f7SAndroid Build Coastguard Workeris a fast code generator that translates architecture-independent `PNaCl bitcode
10*03ce13f7SAndroid Build Coastguard Worker<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_ into
11*03ce13f7SAndroid Build Coastguard Workerarchitecture-specific machine code.  PNaCl bitcode is LLVM bitcode that has been
12*03ce13f7SAndroid Build Coastguard Workersimplified (e.g. weird-width primitive types like 57-bit integers are not
13*03ce13f7SAndroid Build Coastguard Workerallowed) and has had architecture-independent optimizations already applied.
14*03ce13f7SAndroid Build Coastguard WorkerSubzero aims to generate high-quality code as fast as practical, and as such
15*03ce13f7SAndroid Build Coastguard WorkerSubzero needs to make tradeoffs between compilation speed and output code
16*03ce13f7SAndroid Build Coastguard Workerquality.
17*03ce13f7SAndroid Build Coastguard Worker
18*03ce13f7SAndroid Build Coastguard WorkerIn Subzero, we have found register allocation to be one of the most important
19*03ce13f7SAndroid Build Coastguard Workeroptimizations toward achieving the best code quality, which is in tension with
20*03ce13f7SAndroid Build Coastguard Workerregister allocation's reputation for being complex and expensive.  Linear-scan
21*03ce13f7SAndroid Build Coastguard Workerregister allocation is a modern favorite for getting fairly good register
22*03ce13f7SAndroid Build Coastguard Workerallocation at relatively low cost.  Subzero uses linear-scan for its core
23*03ce13f7SAndroid Build Coastguard Workerregister allocator, with a few internal modifications to improve its
24*03ce13f7SAndroid Build Coastguard Workereffectiveness (specifically: register preference, register overlap, and register
25*03ce13f7SAndroid Build Coastguard Workeraliases).  Subzero also does a few high-level things on top of its core register
26*03ce13f7SAndroid Build Coastguard Workerallocator to improve overall effectiveness (specifically: repeat until
27*03ce13f7SAndroid Build Coastguard Workerconvergence, delayed phi lowering, and local live range splitting).
28*03ce13f7SAndroid Build Coastguard Worker
29*03ce13f7SAndroid Build Coastguard WorkerWhat we describe here are techniques that have worked well for Subzero, in the
30*03ce13f7SAndroid Build Coastguard Workercontext of its particular intermediate representation (IR) and compilation
31*03ce13f7SAndroid Build Coastguard Workerstrategy.  Some of these techniques may not be applicable to another compiler,
32*03ce13f7SAndroid Build Coastguard Workerdepending on its particular IR and compilation strategy.  Some key concepts in
33*03ce13f7SAndroid Build Coastguard WorkerSubzero are the following:
34*03ce13f7SAndroid Build Coastguard Worker
35*03ce13f7SAndroid Build Coastguard Worker- Subzero's ``Variable`` operand is an operand that resides either on the stack
36*03ce13f7SAndroid Build Coastguard Worker  or in a physical register.  A Variable can be tagged as *must-have-register*
37*03ce13f7SAndroid Build Coastguard Worker  or *must-not-have-register*, but its default is *may-have-register*.  All uses
38*03ce13f7SAndroid Build Coastguard Worker  of the Variable map to the same physical register or stack location.
39*03ce13f7SAndroid Build Coastguard Worker
40*03ce13f7SAndroid Build Coastguard Worker- Basic lowering is done before register allocation.  Lowering is the process of
41*03ce13f7SAndroid Build Coastguard Worker  translating PNaCl bitcode instructions into native instructions.  Sometimes a
42*03ce13f7SAndroid Build Coastguard Worker  native instruction, like the x86 ``add`` instruction, allows one of its
43*03ce13f7SAndroid Build Coastguard Worker  Variable operands to be either in a physical register or on the stack, in
44*03ce13f7SAndroid Build Coastguard Worker  which case the lowering is relatively simple.  But if the lowered instruction
45*03ce13f7SAndroid Build Coastguard Worker  requires the operand to be in a physical register, we generate code that
46*03ce13f7SAndroid Build Coastguard Worker  copies the Variable into a *must-have-register* temporary, and then use that
47*03ce13f7SAndroid Build Coastguard Worker  temporary in the lowered instruction.
48*03ce13f7SAndroid Build Coastguard Worker
49*03ce13f7SAndroid Build Coastguard Worker- Instructions within a basic block appear in a linearized order (as opposed to
50*03ce13f7SAndroid Build Coastguard Worker  something like a directed acyclic graph of dependency edges).
51*03ce13f7SAndroid Build Coastguard Worker
52*03ce13f7SAndroid Build Coastguard Worker- An instruction has 0 or 1 *dest* Variables and an arbitrary (but usually
53*03ce13f7SAndroid Build Coastguard Worker  small) number of *source* Variables.  Assuming SSA form, the instruction
54*03ce13f7SAndroid Build Coastguard Worker  begins the live range of the dest Variable, and may end the live range of one
55*03ce13f7SAndroid Build Coastguard Worker  or more of the source Variables.
56*03ce13f7SAndroid Build Coastguard Worker
57*03ce13f7SAndroid Build Coastguard WorkerExecutive summary
58*03ce13f7SAndroid Build Coastguard Worker-----------------
59*03ce13f7SAndroid Build Coastguard Worker
60*03ce13f7SAndroid Build Coastguard Worker- Liveness analysis and live range construction are prerequisites for register
61*03ce13f7SAndroid Build Coastguard Worker  allocation.  Without careful attention, they can be potentially very
62*03ce13f7SAndroid Build Coastguard Worker  expensive, especially when the number of variables and basic blocks gets very
63*03ce13f7SAndroid Build Coastguard Worker  large.  Subzero uses some clever data structures to take advantage of the
64*03ce13f7SAndroid Build Coastguard Worker  sparsity of the data, resulting in stable performance as function size scales
65*03ce13f7SAndroid Build Coastguard Worker  up.  This means that the proportion of compilation time spent on liveness
66*03ce13f7SAndroid Build Coastguard Worker  analysis stays roughly the same.
67*03ce13f7SAndroid Build Coastguard Worker
68*03ce13f7SAndroid Build Coastguard Worker- The core of Subzero's register allocator is taken from `Mössenböck and
69*03ce13f7SAndroid Build Coastguard Worker  Pfeiffer's paper <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_ on
70*03ce13f7SAndroid Build Coastguard Worker  linear-scan register allocation.
71*03ce13f7SAndroid Build Coastguard Worker
72*03ce13f7SAndroid Build Coastguard Worker- We enhance the core algorithm with a good automatic preference mechanism when
73*03ce13f7SAndroid Build Coastguard Worker  more than one register is available, to try to minimize register shuffling.
74*03ce13f7SAndroid Build Coastguard Worker
75*03ce13f7SAndroid Build Coastguard Worker- We also enhance it to allow overlapping live ranges to share the same
76*03ce13f7SAndroid Build Coastguard Worker  register, when one variable is recognized as a read-only copy of the other.
77*03ce13f7SAndroid Build Coastguard Worker
78*03ce13f7SAndroid Build Coastguard Worker- We deal with register aliasing in a clean and general fashion.  Register
79*03ce13f7SAndroid Build Coastguard Worker  aliasing is when e.g. 16-bit architectural registers share some bits with
80*03ce13f7SAndroid Build Coastguard Worker  their 32-bit counterparts, or 64-bit registers are actually pairs of 32-bit
81*03ce13f7SAndroid Build Coastguard Worker  registers.
82*03ce13f7SAndroid Build Coastguard Worker
83*03ce13f7SAndroid Build Coastguard Worker- We improve register allocation by rerunning the algorithm on likely candidates
84*03ce13f7SAndroid Build Coastguard Worker  that didn't get registers during the previous iteration, without imposing much
85*03ce13f7SAndroid Build Coastguard Worker  additional cost.
86*03ce13f7SAndroid Build Coastguard Worker
87*03ce13f7SAndroid Build Coastguard Worker- The main register allocation is done before phi lowering, because phi lowering
88*03ce13f7SAndroid Build Coastguard Worker  imposes early and unnecessary ordering constraints on the resulting
89*03ce13f7SAndroid Build Coastguard Worker  assigments, which create spurious interferences in live ranges.
90*03ce13f7SAndroid Build Coastguard Worker
91*03ce13f7SAndroid Build Coastguard Worker- Within each basic block, we aggressively split each variable's live range at
92*03ce13f7SAndroid Build Coastguard Worker  every use, so that portions of the live range can get registers even if the
93*03ce13f7SAndroid Build Coastguard Worker  whole live range can't.  Doing this separately for each basic block avoids
94*03ce13f7SAndroid Build Coastguard Worker  merge complications, and keeps liveness analysis and register allocation fast
95*03ce13f7SAndroid Build Coastguard Worker  by fitting well into the sparsity optimizations of their data structures.
96*03ce13f7SAndroid Build Coastguard Worker
97*03ce13f7SAndroid Build Coastguard WorkerLiveness analysis
98*03ce13f7SAndroid Build Coastguard Worker-----------------
99*03ce13f7SAndroid Build Coastguard Worker
100*03ce13f7SAndroid Build Coastguard WorkerWith respect to register allocation, the main purpose of liveness analysis is to
101*03ce13f7SAndroid Build Coastguard Workercalculate the live range of each variable.  The live range is represented as a
102*03ce13f7SAndroid Build Coastguard Workerset of instruction number ranges.  Instruction numbers within a basic block must
103*03ce13f7SAndroid Build Coastguard Workerbe monotonically increasing, and the instruction ranges of two different basic
104*03ce13f7SAndroid Build Coastguard Workerblocks must not overlap.
105*03ce13f7SAndroid Build Coastguard Worker
106*03ce13f7SAndroid Build Coastguard WorkerBasic liveness
107*03ce13f7SAndroid Build Coastguard Worker^^^^^^^^^^^^^^
108*03ce13f7SAndroid Build Coastguard Worker
109*03ce13f7SAndroid Build Coastguard WorkerLiveness analysis is a straightforward dataflow algorithm.  For each basic
110*03ce13f7SAndroid Build Coastguard Workerblock, we keep track of the live-in and live-out set, i.e. the set of variables
111*03ce13f7SAndroid Build Coastguard Workerthat are live coming into or going out of the basic block.  Processing of a
112*03ce13f7SAndroid Build Coastguard Workerbasic block starts by initializing a temporary set as the union of the live-in
113*03ce13f7SAndroid Build Coastguard Workersets of the basic block's successor blocks.  (This basic block's live-out set is
114*03ce13f7SAndroid Build Coastguard Workercaptured as the initial value of the temporary set.)  Then each instruction of
115*03ce13f7SAndroid Build Coastguard Workerthe basic block is processed in reverse order.  All the source variables of the
116*03ce13f7SAndroid Build Coastguard Workerinstruction are marked as live, by adding them to the temporary set, and the
117*03ce13f7SAndroid Build Coastguard Workerdest variable of the instruction (if any) is marked as not-live, by removing it
118*03ce13f7SAndroid Build Coastguard Workerfrom the temporary set.  When we finish processing all of the block's
119*03ce13f7SAndroid Build Coastguard Workerinstructions, we add/union the temporary set into the basic block's live-in set.
120*03ce13f7SAndroid Build Coastguard WorkerIf this changes the basic block's live-in set, then we mark all of this basic
121*03ce13f7SAndroid Build Coastguard Workerblock's predecessor blocks to be reprocessed.  Then we repeat for other basic
122*03ce13f7SAndroid Build Coastguard Workerblocks until convergence, i.e. no more basic blocks are marked to be
123*03ce13f7SAndroid Build Coastguard Workerreprocessed.  If basic blocks are processed in reverse topological order, then
124*03ce13f7SAndroid Build Coastguard Workerthe number of times each basic block need to be reprocessed is generally its
125*03ce13f7SAndroid Build Coastguard Workerloop nest depth.
126*03ce13f7SAndroid Build Coastguard Worker
127*03ce13f7SAndroid Build Coastguard WorkerThe result of this pass is the live-in and live-out set for each basic block.
128*03ce13f7SAndroid Build Coastguard Worker
129*03ce13f7SAndroid Build Coastguard WorkerWith so many set operations, choice of data structure is crucial for
130*03ce13f7SAndroid Build Coastguard Workerperformance.  We tried a few options, and found that a simple dense bit vector
131*03ce13f7SAndroid Build Coastguard Workerworks best.  This keeps the per-instruction cost very low.  However, we found
132*03ce13f7SAndroid Build Coastguard Workerthat when the function gets very large, merging and copying bit vectors at basic
133*03ce13f7SAndroid Build Coastguard Workerblock boundaries dominates the cost.  This is due to the number of variables
134*03ce13f7SAndroid Build Coastguard Worker(hence the bit vector size) and the number of basic blocks getting large.
135*03ce13f7SAndroid Build Coastguard Worker
136*03ce13f7SAndroid Build Coastguard WorkerA simple enhancement brought this under control in Subzero.  It turns out that
137*03ce13f7SAndroid Build Coastguard Workerthe vast majority of variables are referenced, and therefore live, only in a
138*03ce13f7SAndroid Build Coastguard Workersingle basic block.  This is largely due to the SSA form of PNaCl bitcode.  To
139*03ce13f7SAndroid Build Coastguard Workertake advantage of this, we partition variables by single-block versus
140*03ce13f7SAndroid Build Coastguard Workermulti-block liveness.  Multi-block variables get lower-numbered bit vector
141*03ce13f7SAndroid Build Coastguard Workerindexes, and single-block variables get higher-number indexes.  Single-block bit
142*03ce13f7SAndroid Build Coastguard Workervector indexes are reused across different basic blocks.  As such, the size of
143*03ce13f7SAndroid Build Coastguard Workerlive-in and live-out bit vectors is limited to the number of multi-block
144*03ce13f7SAndroid Build Coastguard Workervariables, and the temporary set's size can be limited to that plus the largest
145*03ce13f7SAndroid Build Coastguard Workernumber of single-block variables across all basic blocks.
146*03ce13f7SAndroid Build Coastguard Worker
147*03ce13f7SAndroid Build Coastguard WorkerFor the purpose of live range construction, we also need to track definitions
148*03ce13f7SAndroid Build Coastguard Worker(LiveBegin) and last-uses (LiveEnd) of variables used within instructions of the
149*03ce13f7SAndroid Build Coastguard Workerbasic block.  These are easy to detect while processing the instructions; data
150*03ce13f7SAndroid Build Coastguard Workerstructure choices are described below.
151*03ce13f7SAndroid Build Coastguard Worker
152*03ce13f7SAndroid Build Coastguard WorkerLive range construction
153*03ce13f7SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^
154*03ce13f7SAndroid Build Coastguard Worker
155*03ce13f7SAndroid Build Coastguard WorkerAfter the live-in and live-out sets are calculated, we construct each variable's
156*03ce13f7SAndroid Build Coastguard Workerlive range (as an ordered list of instruction ranges, described above).  We do
157*03ce13f7SAndroid Build Coastguard Workerthis by considering the live-in and live-out sets, combined with LiveBegin and
158*03ce13f7SAndroid Build Coastguard WorkerLiveEnd information.  This is done separately for each basic block.
159*03ce13f7SAndroid Build Coastguard Worker
160*03ce13f7SAndroid Build Coastguard WorkerAs before, we need to take advantage of sparsity of variable uses across basic
161*03ce13f7SAndroid Build Coastguard Workerblocks, to avoid overly copying/merging data structures.  The following is what
162*03ce13f7SAndroid Build Coastguard Workerworked well for Subzero (after trying several other options).
163*03ce13f7SAndroid Build Coastguard Worker
164*03ce13f7SAndroid Build Coastguard WorkerThe basic liveness pass, described above, keeps track of when a variable's live
165*03ce13f7SAndroid Build Coastguard Workerrange begins or ends within the block.  LiveBegin and LiveEnd are unordered
166*03ce13f7SAndroid Build Coastguard Workervectors where each element is a pair of the variable and the instruction number,
167*03ce13f7SAndroid Build Coastguard Workerrepresenting that the particular variable's live range begins or ends at the
168*03ce13f7SAndroid Build Coastguard Workerparticular instruction.  When the liveness pass finds a variable whose live
169*03ce13f7SAndroid Build Coastguard Workerrange begins or ends, it appends and entry to LiveBegin or LiveEnd.
170*03ce13f7SAndroid Build Coastguard Worker
171*03ce13f7SAndroid Build Coastguard WorkerDuring live range construction, the LiveBegin and LiveEnd vectors are sorted by
172*03ce13f7SAndroid Build Coastguard Workervariable number.  Then we iterate across both vectors in parallel.  If a
173*03ce13f7SAndroid Build Coastguard Workervariable appears in both LiveBegin and LiveEnd, then its live range is entirely
174*03ce13f7SAndroid Build Coastguard Workerwithin this block.  If it appears in only LiveBegin, then its live range starts
175*03ce13f7SAndroid Build Coastguard Workerhere and extends through the end of the block.  If it appears in only LiveEnd,
176*03ce13f7SAndroid Build Coastguard Workerthen its live range starts at the beginning of the block and ends here.  (Note
177*03ce13f7SAndroid Build Coastguard Workerthat this only covers the live range within this block, and this process is
178*03ce13f7SAndroid Build Coastguard Workerrepeated across all blocks.)
179*03ce13f7SAndroid Build Coastguard Worker
180*03ce13f7SAndroid Build Coastguard WorkerIt is also possible that a variable is live within this block but its live range
181*03ce13f7SAndroid Build Coastguard Workerdoes not begin or end in this block.  These variables are identified simply by
182*03ce13f7SAndroid Build Coastguard Workertaking the intersection of the live-in and live-out sets.
183*03ce13f7SAndroid Build Coastguard Worker
184*03ce13f7SAndroid Build Coastguard WorkerAs a result of these data structures, performance of liveness analysis and live
185*03ce13f7SAndroid Build Coastguard Workerrange construction tend to be stable across small, medium, and large functions,
186*03ce13f7SAndroid Build Coastguard Workeras measured by a fairly consistent proportion of total compilation time spent on
187*03ce13f7SAndroid Build Coastguard Workerthe liveness passes.
188*03ce13f7SAndroid Build Coastguard Worker
189*03ce13f7SAndroid Build Coastguard WorkerLinear-scan register allocation
190*03ce13f7SAndroid Build Coastguard Worker-------------------------------
191*03ce13f7SAndroid Build Coastguard Worker
192*03ce13f7SAndroid Build Coastguard WorkerThe basis of Subzero's register allocator is the allocator described by
193*03ce13f7SAndroid Build Coastguard WorkerHanspeter Mössenböck and Michael Pfeiffer in `Linear Scan Register Allocation in
194*03ce13f7SAndroid Build Coastguard Workerthe Context of SSA Form and Register Constraints
195*03ce13f7SAndroid Build Coastguard Worker<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_.  It allows live ranges to
196*03ce13f7SAndroid Build Coastguard Workerbe a union of intervals rather than a single conservative interval, and it
197*03ce13f7SAndroid Build Coastguard Workerallows pre-coloring of variables with specific physical registers.
198*03ce13f7SAndroid Build Coastguard Worker
199*03ce13f7SAndroid Build Coastguard WorkerThe paper suggests an approach of aggressively coalescing variables across Phi
200*03ce13f7SAndroid Build Coastguard Workerinstructions (i.e., trying to force Phi source and dest variables to have the
201*03ce13f7SAndroid Build Coastguard Workersame register assignment), but we omit that in favor of the more natural
202*03ce13f7SAndroid Build Coastguard Workerpreference mechanism described below.
203*03ce13f7SAndroid Build Coastguard Worker
204*03ce13f7SAndroid Build Coastguard WorkerWe found the paper quite remarkable in that a straightforward implementation of
205*03ce13f7SAndroid Build Coastguard Workerits pseudo-code led to an entirely correct register allocator.  The only thing
206*03ce13f7SAndroid Build Coastguard Workerwe found in the specification that was even close to a mistake is that it was
207*03ce13f7SAndroid Build Coastguard Workertoo aggressive in evicting inactive ranges in the "else" clause of the
208*03ce13f7SAndroid Build Coastguard WorkerAssignMemLoc routine.  An inactive range only needs to be evicted if it actually
209*03ce13f7SAndroid Build Coastguard Workeroverlaps the current range being considered, whereas the paper evicts it
210*03ce13f7SAndroid Build Coastguard Workerunconditionally.  (Search for "original paper" in Subzero's register allocator
211*03ce13f7SAndroid Build Coastguard Workersource code.)
212*03ce13f7SAndroid Build Coastguard Worker
213*03ce13f7SAndroid Build Coastguard WorkerRegister preference
214*03ce13f7SAndroid Build Coastguard Worker-------------------
215*03ce13f7SAndroid Build Coastguard Worker
216*03ce13f7SAndroid Build Coastguard WorkerThe linear-scan algorithm from the paper talks about choosing an available
217*03ce13f7SAndroid Build Coastguard Workerregister, but isn't specific on how to choose among several available registers.
218*03ce13f7SAndroid Build Coastguard WorkerThe simplest approach is to just choose the first available register, e.g. the
219*03ce13f7SAndroid Build Coastguard Workerlowest-numbered register.  Often a better choice is possible.
220*03ce13f7SAndroid Build Coastguard Worker
221*03ce13f7SAndroid Build Coastguard WorkerSpecifically, if variable ``V`` is assigned in an instruction ``V=f(S1,S2,...)``
222*03ce13f7SAndroid Build Coastguard Workerwith source variables ``S1,S2,...``, and that instruction ends the live range of
223*03ce13f7SAndroid Build Coastguard Workerone of those source variables ``Sn``, and ``Sn`` was assigned a register, then
224*03ce13f7SAndroid Build Coastguard Worker``Sn``'s register is usually a good choice for ``V``.  This is especially true
225*03ce13f7SAndroid Build Coastguard Workerwhen the instruction is a simple assignment, because an assignment where the
226*03ce13f7SAndroid Build Coastguard Workerdest and source variables end up with the same register can be trivially elided,
227*03ce13f7SAndroid Build Coastguard Workerreducing the amount of register-shuffling code.
228*03ce13f7SAndroid Build Coastguard Worker
229*03ce13f7SAndroid Build Coastguard WorkerThis requires being able to find and inspect a variable's defining instruction,
230*03ce13f7SAndroid Build Coastguard Workerwhich is not an assumption in the original paper but is easily tracked in
231*03ce13f7SAndroid Build Coastguard Workerpractice.
232*03ce13f7SAndroid Build Coastguard Worker
233*03ce13f7SAndroid Build Coastguard WorkerRegister overlap
234*03ce13f7SAndroid Build Coastguard Worker----------------
235*03ce13f7SAndroid Build Coastguard Worker
236*03ce13f7SAndroid Build Coastguard WorkerBecause Subzero does register allocation after basic lowering, the lowering has
237*03ce13f7SAndroid Build Coastguard Workerto be prepared for the possibility of any given program variable not getting a
238*03ce13f7SAndroid Build Coastguard Workerphysical register.  It does this by introducing *must-have-register* temporary
239*03ce13f7SAndroid Build Coastguard Workervariables, and copies the program variable into the temporary to ensure that
240*03ce13f7SAndroid Build Coastguard Workerregister requirements in the target instruction are met.
241*03ce13f7SAndroid Build Coastguard Worker
242*03ce13f7SAndroid Build Coastguard WorkerIn many cases, the program variable and temporary variable have overlapping live
243*03ce13f7SAndroid Build Coastguard Workerranges, and would be forced to have different registers even if the temporary
244*03ce13f7SAndroid Build Coastguard Workervariable is effectively a read-only copy of the program variable.  We recognize
245*03ce13f7SAndroid Build Coastguard Workerthis when the program variable has no definitions within the temporary
246*03ce13f7SAndroid Build Coastguard Workervariable's live range, and the temporary variable has no definitions within the
247*03ce13f7SAndroid Build Coastguard Workerprogram variable's live range with the exception of the copy assignment.
248*03ce13f7SAndroid Build Coastguard Worker
249*03ce13f7SAndroid Build Coastguard WorkerThis analysis is done as part of register preference detection.
250*03ce13f7SAndroid Build Coastguard Worker
251*03ce13f7SAndroid Build Coastguard WorkerThe main impact on the linear-scan implementation is that instead of
252*03ce13f7SAndroid Build Coastguard Workersetting/resetting a boolean flag to indicate whether a register is free or in
253*03ce13f7SAndroid Build Coastguard Workeruse, we increment/decrement a number-of-uses counter.
254*03ce13f7SAndroid Build Coastguard Worker
255*03ce13f7SAndroid Build Coastguard WorkerRegister aliases
256*03ce13f7SAndroid Build Coastguard Worker----------------
257*03ce13f7SAndroid Build Coastguard Worker
258*03ce13f7SAndroid Build Coastguard WorkerSometimes registers of different register classes partially overlap.  For
259*03ce13f7SAndroid Build Coastguard Workerexample, in x86, registers ``al`` and ``ah`` alias ``ax`` (though they don't
260*03ce13f7SAndroid Build Coastguard Workeralias each other), and all three alias ``eax`` and ``rax``.  And in ARM,
261*03ce13f7SAndroid Build Coastguard Workerregisters ``s0`` and ``s1`` (which are single-precision floating-point
262*03ce13f7SAndroid Build Coastguard Workerregisters) alias ``d0`` (double-precision floating-point), and registers ``d0``
263*03ce13f7SAndroid Build Coastguard Workerand ``d1`` alias ``q0`` (128-bit vector register).  The linear-scan paper
264*03ce13f7SAndroid Build Coastguard Workerdoesn't address this issue; it assumes that all registers are independent.  A
265*03ce13f7SAndroid Build Coastguard Workersimple solution is to essentially avoid aliasing by disallowing a subset of the
266*03ce13f7SAndroid Build Coastguard Workerregisters, but there is obviously a reduction in code quality when e.g. half of
267*03ce13f7SAndroid Build Coastguard Workerthe registers are taken away.
268*03ce13f7SAndroid Build Coastguard Worker
269*03ce13f7SAndroid Build Coastguard WorkerSubzero handles this more elegantly.  For each register, we keep a bitmask
270*03ce13f7SAndroid Build Coastguard Workerindicating which registers alias/conflict with it.  For example, in x86,
271*03ce13f7SAndroid Build Coastguard Worker``ah``'s alias set is ``ah``, ``ax``, ``eax``, and ``rax`` (but not ``al``), and
272*03ce13f7SAndroid Build Coastguard Workerin ARM, ``s1``'s alias set is ``s1``, ``d0``, and ``q0``.  Whenever we want to
273*03ce13f7SAndroid Build Coastguard Workermark a register as being used or released, we do the same for all of its
274*03ce13f7SAndroid Build Coastguard Workeraliases.
275*03ce13f7SAndroid Build Coastguard Worker
276*03ce13f7SAndroid Build Coastguard WorkerBefore implementing this, we were a bit apprehensive about the compile-time
277*03ce13f7SAndroid Build Coastguard Workerperformance impact.  Instead of setting one bit in a bit vector or decrementing
278*03ce13f7SAndroid Build Coastguard Workerone counter, this generally needs to be done in a loop that iterates over all
279*03ce13f7SAndroid Build Coastguard Workeraliases.  Fortunately, this seemed to make very little difference in
280*03ce13f7SAndroid Build Coastguard Workerperformance, as the bulk of the cost ends up being in live range overlap
281*03ce13f7SAndroid Build Coastguard Workercomputations, which are not affected by register aliasing.
282*03ce13f7SAndroid Build Coastguard Worker
283*03ce13f7SAndroid Build Coastguard WorkerRepeat until convergence
284*03ce13f7SAndroid Build Coastguard Worker------------------------
285*03ce13f7SAndroid Build Coastguard Worker
286*03ce13f7SAndroid Build Coastguard WorkerSometimes the linear-scan algorithm makes a register assignment only to later
287*03ce13f7SAndroid Build Coastguard Workerrevoke it in favor of a higher-priority variable, but it turns out that a
288*03ce13f7SAndroid Build Coastguard Workerdifferent initial register choice would not have been revoked.  For relatively
289*03ce13f7SAndroid Build Coastguard Workerlow compile-time cost, we can give those variables another chance.
290*03ce13f7SAndroid Build Coastguard Worker
291*03ce13f7SAndroid Build Coastguard WorkerDuring register allocation, we keep track of the revoked variables and then do
292*03ce13f7SAndroid Build Coastguard Workeranother round of register allocation targeted only to that set.  We repeat until
293*03ce13f7SAndroid Build Coastguard Workerno new register assignments are made, which is usually just a handful of
294*03ce13f7SAndroid Build Coastguard Workersuccessively cheaper iterations.
295*03ce13f7SAndroid Build Coastguard Worker
296*03ce13f7SAndroid Build Coastguard WorkerAnother approach would be to repeat register allocation for *all* variables that
297*03ce13f7SAndroid Build Coastguard Workerhaven't had a register assigned (rather than variables that got a register that
298*03ce13f7SAndroid Build Coastguard Workerwas subsequently revoked), but our experience is that this greatly increases
299*03ce13f7SAndroid Build Coastguard Workercompile-time cost, with little or no code quality gain.
300*03ce13f7SAndroid Build Coastguard Worker
301*03ce13f7SAndroid Build Coastguard WorkerDelayed Phi lowering
302*03ce13f7SAndroid Build Coastguard Worker--------------------
303*03ce13f7SAndroid Build Coastguard Worker
304*03ce13f7SAndroid Build Coastguard WorkerThe linear-scan algorithm works for phi instructions as well as regular
305*03ce13f7SAndroid Build Coastguard Workerinstructions, but it is tempting to lower phi instructions into non-SSA
306*03ce13f7SAndroid Build Coastguard Workerassignments before register allocation, so that register allocation need only
307*03ce13f7SAndroid Build Coastguard Workerhappen once.
308*03ce13f7SAndroid Build Coastguard Worker
309*03ce13f7SAndroid Build Coastguard WorkerUnfortunately, simple phi lowering imposes an arbitrary ordering on the
310*03ce13f7SAndroid Build Coastguard Workerresulting assignments that can cause artificial overlap/interference between
311*03ce13f7SAndroid Build Coastguard Workerlowered assignments, and can lead to worse register allocation decisions.  As a
312*03ce13f7SAndroid Build Coastguard Workersimple example, consider these two phi instructions which are semantically
313*03ce13f7SAndroid Build Coastguard Workerunordered::
314*03ce13f7SAndroid Build Coastguard Worker
315*03ce13f7SAndroid Build Coastguard Worker  A = phi(B) // B's live range ends here
316*03ce13f7SAndroid Build Coastguard Worker  C = phi(D) // D's live range ends here
317*03ce13f7SAndroid Build Coastguard Worker
318*03ce13f7SAndroid Build Coastguard WorkerA straightforward lowering might yield::
319*03ce13f7SAndroid Build Coastguard Worker
320*03ce13f7SAndroid Build Coastguard Worker  A = B // end of B's live range
321*03ce13f7SAndroid Build Coastguard Worker  C = D // end of D's live range
322*03ce13f7SAndroid Build Coastguard Worker
323*03ce13f7SAndroid Build Coastguard WorkerThe potential problem here is that A and D's live ranges overlap, and so they
324*03ce13f7SAndroid Build Coastguard Workerare prevented from having the same register.  Swapping the order of lowered
325*03ce13f7SAndroid Build Coastguard Workerassignments fixes that (but then B and C would overlap), but we can't really
326*03ce13f7SAndroid Build Coastguard Workerknow which is better until after register allocation.
327*03ce13f7SAndroid Build Coastguard Worker
328*03ce13f7SAndroid Build Coastguard WorkerSubzero deals with this by doing the main register allocation before phi
329*03ce13f7SAndroid Build Coastguard Workerlowering, followed by phi lowering, and finally a special register allocation
330*03ce13f7SAndroid Build Coastguard Workerpass limited to the new lowered assignments.
331*03ce13f7SAndroid Build Coastguard Worker
332*03ce13f7SAndroid Build Coastguard WorkerPhi lowering considers the phi operands separately for each predecessor edge,
333*03ce13f7SAndroid Build Coastguard Workerand starts by finding a topological sort of the Phi instructions, such that
334*03ce13f7SAndroid Build Coastguard Workerassignments can be executed in that order without violating dependencies on
335*03ce13f7SAndroid Build Coastguard Workerregisters or stack locations.  If a topological sort is not possible due to a
336*03ce13f7SAndroid Build Coastguard Workercycle, the cycle is broken by introducing a temporary, e.g. ``A=B;B=C;C=A`` can
337*03ce13f7SAndroid Build Coastguard Workerbecome ``T=A;A=B;B=C;C=T``.  The topological order is tuned to favor freeing up
338*03ce13f7SAndroid Build Coastguard Workerregisters early to reduce register pressure.
339*03ce13f7SAndroid Build Coastguard Worker
340*03ce13f7SAndroid Build Coastguard WorkerIt then lowers the linearized assignments into machine instructions (which may
341*03ce13f7SAndroid Build Coastguard Workerrequire extra physical registers e.g. to copy from one stack location to
342*03ce13f7SAndroid Build Coastguard Workeranother), and finally runs the register allocator limited to those instructions.
343*03ce13f7SAndroid Build Coastguard Worker
344*03ce13f7SAndroid Build Coastguard WorkerIn rare cases, the register allocator may fail on some *must-have-register*
345*03ce13f7SAndroid Build Coastguard Workervariable, because register pressure is too high to satisfy requirements arising
346*03ce13f7SAndroid Build Coastguard Workerfrom cycle-breaking temporaries and registers required for stack-to-stack
347*03ce13f7SAndroid Build Coastguard Workercopies.  In this case, we have to find a register with no active uses within the
348*03ce13f7SAndroid Build Coastguard Workervariable's live range, and actively spill/restore that register around the live
349*03ce13f7SAndroid Build Coastguard Workerrange.  This makes the code quality suffer and may be slow to implement
350*03ce13f7SAndroid Build Coastguard Workerdepending on compiler data structures, but in practice we find the situation to
351*03ce13f7SAndroid Build Coastguard Workerbe vanishingly rare and so not really worth optimizing.
352*03ce13f7SAndroid Build Coastguard Worker
353*03ce13f7SAndroid Build Coastguard WorkerLocal live range splitting
354*03ce13f7SAndroid Build Coastguard Worker--------------------------
355*03ce13f7SAndroid Build Coastguard Worker
356*03ce13f7SAndroid Build Coastguard WorkerThe basic linear-scan algorithm has an "all-or-nothing" policy: a variable gets
357*03ce13f7SAndroid Build Coastguard Workera register for its entire live range, or not at all.  This is unfortunate when a
358*03ce13f7SAndroid Build Coastguard Workervariable has many uses close together, but ultimately a large enough live range
359*03ce13f7SAndroid Build Coastguard Workerto prevent register assignment.  Another bad example is on x86 where all vector
360*03ce13f7SAndroid Build Coastguard Workerand floating-point registers (the ``xmm`` registers) are killed by call
361*03ce13f7SAndroid Build Coastguard Workerinstructions, per the x86 call ABI, so such variables are completely prevented
362*03ce13f7SAndroid Build Coastguard Workerfrom having a register when their live ranges contain a call instruction.
363*03ce13f7SAndroid Build Coastguard Worker
364*03ce13f7SAndroid Build Coastguard WorkerThe general solution here is some policy for splitting live ranges.  A variable
365*03ce13f7SAndroid Build Coastguard Workercan be split into multiple copies and each can be register-allocated separately.
366*03ce13f7SAndroid Build Coastguard WorkerThe complication comes in finding a sane policy for where and when to split
367*03ce13f7SAndroid Build Coastguard Workervariables such that complexity doesn't explode, and how to join the different
368*03ce13f7SAndroid Build Coastguard Workervalues at merge points.
369*03ce13f7SAndroid Build Coastguard Worker
370*03ce13f7SAndroid Build Coastguard WorkerSubzero implements aggressive block-local splitting of variables.  Each basic
371*03ce13f7SAndroid Build Coastguard Workerblock is handled separately and independently.  Within the block, we maintain a
372*03ce13f7SAndroid Build Coastguard Workertable ``T`` that maps each variable ``V`` to its split version ``T[V]``, with
373*03ce13f7SAndroid Build Coastguard Workerevery variable ``V``'s initial state set (implicitly) as ``T[V]=V``.  For each
374*03ce13f7SAndroid Build Coastguard Workerinstruction in the block, and for each *may-have-register* variable ``V`` in the
375*03ce13f7SAndroid Build Coastguard Workerinstruction, we do the following:
376*03ce13f7SAndroid Build Coastguard Worker
377*03ce13f7SAndroid Build Coastguard Worker- Replace all uses of ``V`` in the instruction with ``T[V]``.
378*03ce13f7SAndroid Build Coastguard Worker
379*03ce13f7SAndroid Build Coastguard Worker- Create a new split variable ``V'``.
380*03ce13f7SAndroid Build Coastguard Worker
381*03ce13f7SAndroid Build Coastguard Worker- Add a new assignment ``V'=T[V]``, placing it adjacent to (either immediately
382*03ce13f7SAndroid Build Coastguard Worker  before or immediately after) the current instruction.
383*03ce13f7SAndroid Build Coastguard Worker
384*03ce13f7SAndroid Build Coastguard Worker- Update ``T[V]`` to be ``V'``.
385*03ce13f7SAndroid Build Coastguard Worker
386*03ce13f7SAndroid Build Coastguard WorkerThis leads to a chain of copies of ``V`` through the block, linked by assignment
387*03ce13f7SAndroid Build Coastguard Workerinstructions.  The live ranges of these copies are usually much smaller, and
388*03ce13f7SAndroid Build Coastguard Workermore likely to get registers.  In fact, because of the preference mechanism
389*03ce13f7SAndroid Build Coastguard Workerdescribed above, they are likely to get the same register whenever possible.
390*03ce13f7SAndroid Build Coastguard Worker
391*03ce13f7SAndroid Build Coastguard WorkerOne obvious question comes up: won't this proliferation of new variables cause
392*03ce13f7SAndroid Build Coastguard Workeran explosion in the running time of liveness analysis and register allocation?
393*03ce13f7SAndroid Build Coastguard WorkerAs it turns out, not really.
394*03ce13f7SAndroid Build Coastguard Worker
395*03ce13f7SAndroid Build Coastguard WorkerFirst, for register allocation, the cost tends to be dominated by live range
396*03ce13f7SAndroid Build Coastguard Workeroverlap computations, whose cost is roughly proportional to the size of the live
397*03ce13f7SAndroid Build Coastguard Workerrange.  All the new variable copies' live ranges sum up to the original
398*03ce13f7SAndroid Build Coastguard Workervariable's live range, so the cost isn't vastly greater.
399*03ce13f7SAndroid Build Coastguard Worker
400*03ce13f7SAndroid Build Coastguard WorkerSecond, for liveness analysis, the cost is dominated by merging bit vectors
401*03ce13f7SAndroid Build Coastguard Workercorresponding to the set of variables that have multi-block liveness.  All the
402*03ce13f7SAndroid Build Coastguard Workernew copies are guaranteed to be single-block, so the main additional cost is
403*03ce13f7SAndroid Build Coastguard Workerthat of processing the new assignments.
404*03ce13f7SAndroid Build Coastguard Worker
405*03ce13f7SAndroid Build Coastguard WorkerThere's one other key issue here.  The original variable and all of its copies
406*03ce13f7SAndroid Build Coastguard Workerneed to be "linked", in the sense that all of these variables that get a stack
407*03ce13f7SAndroid Build Coastguard Workerslot (because they didn't get a register) are guaranteed to have the same stack
408*03ce13f7SAndroid Build Coastguard Workerslot.  This way, we can avoid generating any code related to ``V'=V`` when
409*03ce13f7SAndroid Build Coastguard Workerneither gets a register.  In addition, we can elide instructions that write a
410*03ce13f7SAndroid Build Coastguard Workervalue to a stack slot that is linked to another stack slot, because that is
411*03ce13f7SAndroid Build Coastguard Workerguaranteed to be just rewriting the same value to the stack.
412