xref: /aosp_15_r20/external/swiftshader/third_party/subzero/docs/DESIGN.rst (revision 03ce13f70fcc45d86ee91b7ee4cab1936a95046e)
1Design of the Subzero fast code generator
2=========================================
3
4Introduction
5------------
6
7The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes
8compiler technology based on `LLVM <http://llvm.org/>`_.  The developer uses the
9PNaCl toolchain to compile their application to architecture-neutral PNaCl
10bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as
11possible.  The ``.pexe`` file is downloaded to the user's browser where the
12PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to
13`sandboxed
14<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
15native code.  The translator uses architecture-specific optimizations as much as
16practical to generate good native code.
17
18The native code can be cached by the browser to avoid repeating translation on
19future page loads.  However, first-time user experience is hampered by long
20translation times.  The LLVM-based PNaCl translator is pretty slow, even when
21using ``-O0`` to minimize optimizations, so delays are especially noticeable on
22slow browser platforms such as ARM-based Chromebooks.
23
24Translator slowness can be mitigated or hidden in a number of ways.
25
26- Parallel translation.  However, slow machines where this matters most, e.g.
27  ARM-based Chromebooks, are likely to have fewer cores to parallelize across,
28  and are likely to less memory available for multiple translation threads to
29  use.
30
31- Streaming translation, i.e. start translating as soon as the download starts.
32  This doesn't help much when translation speed is 10× slower than download
33  speed, or the ``.pexe`` file is already cached while the translated binary was
34  flushed from the cache.
35
36- Arrange the web page such that translation is done in parallel with
37  downloading large assets.
38
39- Arrange the web page to distract the user with `cat videos
40  <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in
41  progress.
42
43Or, improve translator performance to something more reasonable.
44
45This document describes Subzero's attempt to improve translation speed by an
46order of magnitude while rivaling LLVM's code quality.  Subzero does this
47through minimal IR layering, lean data structures and passes, and a careful
48selection of fast optimization passes.  It has two optimization recipes: full
49optimizations (``O2``) and minimal optimizations (``Om1``).  The recipes are the
50following (described in more detail below):
51
52+----------------------------------------+-----------------------------+
53| O2 recipe                              | Om1 recipe                  |
54+========================================+=============================+
55| Parse .pexe file                       | Parse .pexe file            |
56+----------------------------------------+-----------------------------+
57| Loop nest analysis                     |                             |
58+----------------------------------------+-----------------------------+
59| Local common subexpression elimination |                             |
60+----------------------------------------+-----------------------------+
61| Address mode inference                 |                             |
62+----------------------------------------+-----------------------------+
63| Read-modify-write (RMW) transform      |                             |
64+----------------------------------------+-----------------------------+
65| Basic liveness analysis                |                             |
66+----------------------------------------+-----------------------------+
67| Load optimization                      |                             |
68+----------------------------------------+-----------------------------+
69|                                        | Phi lowering (simple)       |
70+----------------------------------------+-----------------------------+
71| Target lowering                        | Target lowering             |
72+----------------------------------------+-----------------------------+
73| Full liveness analysis                 |                             |
74+----------------------------------------+-----------------------------+
75| Register allocation                    | Minimal register allocation |
76+----------------------------------------+-----------------------------+
77| Phi lowering (advanced)                |                             |
78+----------------------------------------+-----------------------------+
79| Post-phi register allocation           |                             |
80+----------------------------------------+-----------------------------+
81| Branch optimization                    |                             |
82+----------------------------------------+-----------------------------+
83| Code emission                          | Code emission               |
84+----------------------------------------+-----------------------------+
85
86Goals
87=====
88
89Translation speed
90-----------------
91
92We'd like to be able to translate a ``.pexe`` file as fast as download speed.
93Any faster is in a sense wasted effort.  Download speed varies greatly, but
94we'll arbitrarily say 1 MB/sec.  We'll pick the ARM A15 CPU as the example of a
95slow machine.  We observe a 3× single-thread performance difference between A15
96and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a
97``.pexe`` file could be compressed to 50% on the web server using gzip transport
98compression, so we set the translation speed goal to 6 MB/sec on the high-end
99Xeon workstation.
100
101Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at
102⅒ the target rate.  The ``-O2`` mode takes 3× as long as the ``-O0`` mode.
103
104In other words, Subzero's goal is to improve over LLVM's translation speed by
10510×.
106
107Code quality
108------------
109
110Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``
111code quality.  The stretch goal is to approach LLVM ``-O2`` code quality.  On
112average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.
113
114It's important to note that the quality of Subzero-generated code depends on
115target-neutral optimizations and simplifications being run beforehand in the
116developer environment.  The ``.pexe`` file reflects these optimizations.  For
117example, Subzero assumes that the basic blocks are ordered topologically where
118possible (which makes liveness analysis converge fastest), and Subzero does not
119do any function inlining because it should already have been done.
120
121Translator size
122---------------
123
124The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.
125We think 1 MB is a more reasonable size -- especially for such a component that
126is distributed to a billion Chrome users.  Thus we target a 10× reduction in
127binary size.
128
129For development, Subzero can be built for all target architectures, and all
130debugging and diagnostic options enabled.  For a smaller translator, we restrict
131to a single target architecture, and define a ``MINIMAL`` build where
132unnecessary features are compiled out.
133
134Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``
135include directories, which have little impact on translator size.  It also uses
136some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again
137with little size impact.  In non-``MINIMAL`` builds, the translator size is much
138larger due to including code for parsing text-format bitcode files and forming
139LLVM IR.
140
141Memory footprint
142----------------
143
144The current LLVM-based translator suffers from an issue in which some
145function-specific data has to be retained in memory until all translation
146completes, and therefore the memory footprint grows without bound.  Large
147``.pexe`` files can lead to the translator process holding hundreds of MB of
148memory by the end.  The translator runs in a separate process, so this memory
149growth doesn't *directly* affect other processes, but it does dirty the physical
150memory and contributes to a perception of bloat and sometimes a reality of
151out-of-memory tab killing, especially noticeable on weaker systems.
152
153Subzero should maintain a stable memory footprint throughout translation.  It's
154not really practical to set a specific limit, because there is not really a
155practical limit on a single function's size, but the footprint should be
156"reasonable" and be proportional to the largest input function size, not the
157total ``.pexe`` file size.  Simply put, Subzero should not have memory leaks or
158inexorable memory growth.  (We use ASAN builds to test for leaks.)
159
160Multithreaded translation
161-------------------------
162
163It should be practical to translate different functions concurrently and see
164good scalability.  Some locking may be needed, such as accessing output buffers
165or constant pools, but that should be fairly minimal.  In contrast, LLVM was
166only designed for module-level parallelism, and as such, the PNaCl translator
167internally splits a ``.pexe`` file into several modules for concurrent
168translation.  All output needs to be deterministic regardless of the level of
169multithreading, i.e. functions and data should always be output in the same
170order.
171
172Target architectures
173--------------------
174
175Initial target architectures are x86-32, x86-64, ARM32, and MIPS32.  Future
176targets include ARM64 and MIPS64, though these targets lack NaCl support
177including a sandbox model or a validator.
178
179The first implementation is for x86-32, because it was expected to be
180particularly challenging, and thus more likely to draw out any design problems
181early:
182
183- There are a number of special cases, asymmetries, and warts in the x86
184  instruction set.
185
186- Complex addressing modes may be leveraged for better code quality.
187
188- 64-bit integer operations have to be lowered into longer sequences of 32-bit
189  operations.
190
191- Paucity of physical registers may reveal code quality issues early in the
192  design.
193
194Detailed design
195===============
196
197Intermediate representation - ICE
198---------------------------------
199
200Subzero's IR is called ICE.  It is designed to be reasonably similar to LLVM's
201IR, which is reflected in the ``.pexe`` file's bitcode structure.  It has a
202representation of global variables and initializers, and a set of functions.
203Each function contains a list of basic blocks, and each basic block constains a
204list of instructions.  Instructions that operate on stack and register variables
205do so using static single assignment (SSA) form.
206
207The ``.pexe`` file is translated one function at a time (or in parallel by
208multiple translation threads).  The recipe for optimization passes depends on
209the specific target and optimization level, and is described in detail below.
210Global variables (types and initializers) are simply and directly translated to
211object code, without any meaningful attempts at optimization.
212
213A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.
214Its key contents include:
215
216- A list of ``CfgNode`` pointers, generally held in topological order.
217
218- A list of ``Variable`` pointers corresponding to local variables used in the
219  function plus compiler-generated temporaries.
220
221A basic block is represented by the ``Ice::CfgNode`` class.  Its key contents
222include:
223
224- A linear list of instructions, in the same style as LLVM.  The last
225  instruction of the list is always a terminator instruction: branch, switch,
226  return, unreachable.
227
228- A list of Phi instructions, also in the same style as LLVM.  They are held as
229  a linear list for convenience, though per Phi semantics, they are executed "in
230  parallel" without dependencies on each other.
231
232- An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and
233  another list for outgoing edges.
234
235- The node's unique, 0-based index into the CFG's node list.
236
237An instruction is represented by the ``Ice::Inst`` class.  Its key contents
238include:
239
240- A list of source operands.
241
242- Its destination variable, if the instruction produces a result in an
243  ``Ice::Variable``.
244
245- A bitvector indicating which variables' live ranges this instruction ends.
246  This is computed during liveness analysis.
247
248Instructions kinds are divided into high-level ICE instructions and low-level
249ICE instructions.  High-level instructions consist of the PNaCl/LLVM bitcode
250instruction kinds.  Each target architecture implementation extends the
251instruction space with its own set of low-level instructions.  Generally,
252low-level instructions correspond to individual machine instructions.  The
253high-level ICE instruction space includes a few additional instruction kinds
254that are not part of LLVM but are generally useful (e.g., an Assignment
255instruction), or are useful across targets.
256
257Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl
258ABI restrictions as documented in the `PNaCl Bitcode Reference Manual
259<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are
260the following:
261
262- Alloca: allocate data on the stack
263
264- Arithmetic: binary operations of the form ``A = B op C``
265
266- Br: conditional or unconditional branch
267
268- Call: function call
269
270- Cast: unary type-conversion operations
271
272- ExtractElement: extract a scalar element from a vector-type value
273
274- Fcmp: floating-point comparison
275
276- Icmp: integer comparison
277
278- Intrinsic: invoke a known intrinsic
279
280- InsertElement: insert a scalar element into a vector-type value
281
282- Load: load a value from memory
283
284- Phi: implement the SSA phi node
285
286- Ret: return from the function
287
288- Select: essentially the C language operation of the form ``X = C ? Y : Z``
289
290- Store: store a value into memory
291
292- Switch: generalized branch to multiple possible locations
293
294- Unreachable: indicate that this portion of the code is unreachable
295
296The additional high-level ICE instructions are the following:
297
298- Assign: a simple ``A=B`` assignment.  This is useful for e.g. lowering Phi
299  instructions to non-SSA assignments, before lowering to machine code.
300
301- FakeDef, FakeUse, FakeKill.  These are tools used to preserve consistency in
302  liveness analysis, elevated to the high-level because they are used by all
303  targets.  They are described in more detail at the end of this section.
304
305- JumpTable: this represents the result of switch optimization analysis, where
306  some switch instructions may use jump tables instead of cascading
307  compare/branches.
308
309An operand is represented by the ``Ice::Operand`` class.  In high-level ICE, an
310operand is either an ``Ice::Constant`` or an ``Ice::Variable``.  Constants
311include scalar integer constants, scalar floating point constants, Undef (an
312unspecified constant of a particular scalar or vector type), and symbol
313constants (essentially addresses of globals).  Note that the PNaCl ABI does not
314include vector-type constants besides Undef, and as such, Subzero (so far) has
315no reason to represent vector-type constants internally.  A variable represents
316a value allocated on the stack (though not including alloca-derived storage).
317Among other things, a variable holds its unique, 0-based index into the CFG's
318variable list.
319
320Each target can extend the ``Constant`` and ``Variable`` classes for its own
321needs.  In addition, the ``Operand`` class may be extended, e.g. to define an
322x86 ``MemOperand`` that encodes a base register, an index register, an index
323register shift amount, and a constant offset.
324
325Register allocation and liveness analysis are restricted to Variable operands.
326Because of the importance of register allocation to code quality, and the
327translation-time cost of liveness analysis, Variable operands get some special
328treatment in ICE.  Most notably, a frequent pattern in Subzero is to iterate
329across all the Variables of an instruction.  An instruction holds a list of
330operands, but an operand may contain 0, 1, or more Variables.  As such, the
331``Operand`` class specially holds a list of Variables contained within, for
332quick access.
333
334A Subzero transformation pass may work by deleting an existing instruction and
335replacing it with zero or more new instructions.  Instead of actually deleting
336the existing instruction, we generally mark it as deleted and insert the new
337instructions right after the deleted instruction.  When printing the IR for
338debugging, this is a big help because it makes it much more clear how the
339non-deleted instructions came about.
340
341Subzero has a few special instructions to help with liveness analysis
342consistency.
343
344- The FakeDef instruction gives a fake definition of some variable.  For
345  example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``
346  but an ICE instruction can represent only one destination variable.  This is
347  similar for multiply instructions, and for function calls that return a 64-bit
348  integer result in the ``%edx:%eax`` pair.  Also, using the ``xor %eax, %eax``
349  trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.
350
351- The FakeUse instruction registers a use of a variable, typically to prevent an
352  earlier assignment to that variable from being dead-code eliminated.  For
353  example, lowering an operation like ``x=cc?y:z`` may be done using x86's
354  conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``.  Without a
355  FakeUse of ``x`` between the two instructions, the liveness analysis pass may
356  dead-code eliminate the first instruction.
357
358- The FakeKill instruction is added after a call instruction, and is a quick way
359  of indicating that caller-save registers are invalidated.
360
361Pexe parsing
362------------
363
364Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files.  It
365parses the ``.pexe`` file function by function, ultimately constructing an ICE
366CFG for each function.  After a function is parsed, its CFG is handed off to the
367translation phase.  The bitcode parser also parses global initializer data and
368hands it off to be translated to data sections in the object file.
369
370Subzero has another parsing strategy for testing/debugging.  LLVM libraries can
371be used to parse a module into LLVM IR (though very slowly relative to Subzero
372native parsing).  Then we iterate across the LLVM IR and construct high-level
373ICE, handing off each CFG to the translation phase.
374
375Overview of lowering
376--------------------
377
378In general, translation goes like this:
379
380- Parse the next function from the ``.pexe`` file and construct a CFG consisting
381  of high-level ICE.
382
383- Do analysis passes and transformation passes on the high-level ICE, as
384  desired.
385
386- Lower each high-level ICE instruction into a sequence of zero or more
387  low-level ICE instructions.  Each high-level instruction is generally lowered
388  independently, though the target lowering is allowed to look ahead in the
389  CfgNode's instruction list if desired.
390
391- Do more analysis and transformation passes on the low-level ICE, as desired.
392
393- Assemble the low-level CFG into an ELF object file (alternatively, a textual
394  assembly file that is later assembled by some external tool).
395
396- Repeat for all functions, and also produce object code for data such as global
397  initializers and internal constant pools.
398
399Currently there are two optimization levels: ``O2`` and ``Om1``.  For ``O2``,
400the intention is to apply all available optimizations to get the best code
401quality (though the initial code quality goal is measured against LLVM's ``O0``
402code quality).  For ``Om1``, the intention is to apply as few optimizations as
403possible and produce code as quickly as possible, accepting poor code quality.
404``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,
405"sub-zero".
406
407High-level debuggability of generated code is so far not a design requirement.
408Subzero doesn't really do transformations that would obfuscate debugging; the
409main thing might be that register allocation (including stack slot coalescing
410for stack-allocated variables whose live ranges don't overlap) may render a
411variable's value unobtainable after its live range ends.  This would not be an
412issue for ``Om1`` since it doesn't register-allocate program-level variables,
413nor does it coalesce stack slots.  That said, fully supporting debuggability
414would require a few additions:
415
416- DWARF support would need to be added to Subzero's ELF file emitter.  Subzero
417  propagates global symbol names, local variable names, and function-internal
418  label names that are present in the ``.pexe`` file.  This would allow a
419  debugger to map addresses back to symbols in the ``.pexe`` file.
420
421- To map ``.pexe`` file symbols back to meaningful source-level symbol names,
422  file names, line numbers, etc., Subzero would need to handle `LLVM bitcode
423  metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg``
424  `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_.
425
426- The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so
427  the toolchain would need to be modified to preserve it.
428
429Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but
430produces code with one third the code quality.  ``Om1`` is good for testing and
431debugging -- during translation, it tends to expose errors in the basic lowering
432that might otherwise have been hidden by the register allocator or other
433optimization passes.  It also helps determine whether a code correctness problem
434is a fundamental problem in the basic lowering, or an error in another
435optimization pass.
436
437The implementation of target lowering also controls the recipe of passes used
438for ``Om1`` and ``O2`` translation.  For example, address mode inference may
439only be relevant for x86.
440
441Lowering strategy
442-----------------
443
444The core of Subzero's lowering from high-level ICE to low-level ICE is to lower
445each high-level instruction down to a sequence of low-level target-specific
446instructions, in a largely context-free setting.  That is, each high-level
447instruction conceptually has a simple template expansion into low-level
448instructions, and lowering can in theory be done in any order.  This may sound
449like a small effort, but quite a large number of templates may be needed because
450of the number of PNaCl types and instruction variants.  Furthermore, there may
451be optimized templates, e.g. to take advantage of operator commutativity (for
452example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``).  This is
453similar to other template-based approaches in fast code generation or
454interpretation, though some decisions are deferred until after some global
455analysis passes, mostly related to register allocation, stack slot assignment,
456and specific choice of instruction variant and addressing mode.
457
458The key idea for a lowering template is to produce valid low-level instructions
459that are guaranteed to meet address mode and other structural requirements of
460the instruction set.  For example, on x86, the source operand of an integer
461store instruction must be an immediate or a physical register; a shift
462instruction's shift amount must be an immediate or in register ``%cl``; a
463function's integer return value is in ``%eax``; most x86 instructions are
464two-operand, in contrast to corresponding three-operand high-level instructions;
465etc.
466
467Because target lowering runs before register allocation, there is no way to know
468whether a given ``Ice::Variable`` operand lives on the stack or in a physical
469register.  When the low-level instruction calls for a physical register operand,
470the target lowering can create an infinite-weight Variable.  This tells the
471register allocator to assign infinite weight when making decisions, effectively
472guaranteeing some physical register.  Variables can also be pre-colored to a
473specific physical register (``cl`` in the shift example above), which also gives
474infinite weight.
475
476To illustrate, consider a high-level arithmetic instruction on 32-bit integer
477operands::
478
479    A = B + C
480
481X86 target lowering might produce the following::
482
483    T.inf = B  // mov instruction
484    T.inf += C // add instruction
485    A = T.inf  // mov instruction
486
487Here, ``T.inf`` is an infinite-weight temporary.  As long as ``T.inf`` has a
488physical register, the three lowered instructions are all encodable regardless
489of whether ``B`` and ``C`` are physical registers, memory, or immediates, and
490whether ``A`` is a physical register or in memory.
491
492In this example, ``A`` must be a Variable and one may be tempted to simplify the
493lowering sequence by setting ``A`` as infinite-weight and using::
494
495        A = B  // mov instruction
496        A += C // add instruction
497
498This has two problems.  First, if the original instruction was actually ``A =
499B + A``, the result would be incorrect.  Second, assigning ``A`` a physical
500register applies throughout ``A``'s entire live range.  This is probably not
501what is intended, and may ultimately lead to a failure to allocate a register
502for an infinite-weight variable.
503
504This style of lowering leads to many temporaries being generated, so in ``O2``
505mode, we rely on the register allocator to clean things up.  For example, in the
506example above, if ``B`` ends up getting a physical register and its live range
507ends at this instruction, the register allocator is likely to reuse that
508register for ``T.inf``.  This leads to ``T.inf=B`` being a redundant register
509copy, which is removed as an emission-time peephole optimization.
510
511O2 lowering
512-----------
513
514Currently, the ``O2`` lowering recipe is the following:
515
516- Loop nest analysis
517
518- Local common subexpression elimination
519
520- Address mode inference
521
522- Read-modify-write (RMW) transformation
523
524- Basic liveness analysis
525
526- Load optimization
527
528- Target lowering
529
530- Full liveness analysis
531
532- Register allocation
533
534- Phi instruction lowering (advanced)
535
536- Post-phi lowering register allocation
537
538- Branch optimization
539
540These passes are described in more detail below.
541
542Om1 lowering
543------------
544
545Currently, the ``Om1`` lowering recipe is the following:
546
547- Phi instruction lowering (simple)
548
549- Target lowering
550
551- Register allocation (infinite-weight and pre-colored only)
552
553Optimization passes
554-------------------
555
556Liveness analysis
557^^^^^^^^^^^^^^^^^
558
559Liveness analysis is a standard dataflow optimization, implemented as follows.
560For each node (basic block), its live-out set is computed as the union of the
561live-in sets of its successor nodes.  Then the node's instructions are processed
562in reverse order, updating the live set, until the beginning of the node is
563reached, and the node's live-in set is recorded.  If this iteration has changed
564the node's live-in set, the node's predecessors are marked for reprocessing.
565This continues until no more nodes need reprocessing.  If nodes are processed in
566reverse topological order, the number of iterations over the CFG is generally
567equal to the maximum loop nest depth.
568
569To implement this, each node records its live-in and live-out sets, initialized
570to the empty set.  Each instruction records which of its Variables' live ranges
571end in that instruction, initialized to the empty set.  A side effect of
572liveness analysis is dead instruction elimination.  Each instruction can be
573marked as tentatively dead, and after the algorithm converges, the tentatively
574dead instructions are permanently deleted.
575
576Optionally, after this liveness analysis completes, we can do live range
577construction, in which we calculate the live range of each variable in terms of
578instruction numbers.  A live range is represented as a union of segments, where
579the segment endpoints are instruction numbers.  Instruction numbers are required
580to be unique across the CFG, and monotonically increasing within a basic block.
581As a union of segments, live ranges can contain "gaps" and are therefore
582precise.  Because of SSA properties, a variable's live range can start at most
583once in a basic block, and can end at most once in a basic block.  Liveness
584analysis keeps track of which variable/instruction tuples begin live ranges and
585end live ranges, and combined with live-in and live-out sets, we can efficiently
586build up live ranges of all variables across all basic blocks.
587
588A lot of care is taken to try to make liveness analysis fast and efficient.
589Because of the lowering strategy, the number of variables is generally
590proportional to the number of instructions, leading to an O(N^2) complexity
591algorithm if implemented naively.  To improve things based on sparsity, we note
592that most variables are "local" and referenced in at most one basic block (in
593contrast to the "global" variables with multi-block usage), and therefore cannot
594be live across basic blocks.  Therefore, the live-in and live-out sets,
595typically represented as bit vectors, can be limited to the set of global
596variables, and the intra-block liveness bit vector can be compacted to hold the
597global variables plus the local variables for that block.
598
599Register allocation
600^^^^^^^^^^^^^^^^^^^
601
602Subzero implements a simple linear-scan register allocator, based on the
603allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan
604Register Allocation in the Context of SSA Form and Register Constraints
605<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_.  This allocator has
606several nice features:
607
608- Live ranges are represented as unions of segments, as described above, rather
609  than a single start/end tuple.
610
611- It allows pre-coloring of variables with specific physical registers.
612
613- It applies equally well to pre-lowered Phi instructions.
614
615The paper suggests an approach of aggressively coalescing variables across Phi
616instructions (i.e., trying to force Phi source and destination variables to have
617the same register assignment), but we reject that in favor of the more natural
618preference mechanism described below.
619
620We enhance the algorithm in the paper with the capability of automatic inference
621of register preference, and with the capability of allowing overlapping live
622ranges to safely share the same register in certain circumstances.  If we are
623considering register allocation for variable ``A``, and ``A`` has a single
624defining instruction ``A=B+C``, then the preferred register for ``A``, if
625available, would be the register assigned to ``B`` or ``C``, if any, provided
626that ``B`` or ``C``'s live range does not overlap ``A``'s live range.  In this
627way we infer a good register preference for ``A``.
628
629We allow overlapping live ranges to get the same register in certain cases.
630Suppose a high-level instruction like::
631
632    A = unary_op(B)
633
634has been target-lowered like::
635
636    T.inf = B
637    A = unary_op(T.inf)
638
639Further, assume that ``B``'s live range continues beyond this instruction
640sequence, and that ``B`` has already been assigned some register.  Normally, we
641might want to infer ``B``'s register as a good candidate for ``T.inf``, but it
642turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have
643different registers.  But ``T.inf`` is just a read-only copy of ``B`` that is
644guaranteed to be in a register, so in theory these overlapping live ranges could
645safely have the same register.  Our implementation allows this overlap as long
646as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never
647modified within ``T.inf``'s live range.
648
649Subzero's register allocator can be run in 3 configurations.
650
651- Normal mode.  All Variables are considered for register allocation.  It
652  requires full liveness analysis and live range construction as a prerequisite.
653  This is used by ``O2`` lowering.
654
655- Minimal mode.  Only infinite-weight or pre-colored Variables are considered.
656  All other Variables are stack-allocated.  It does not require liveness
657  analysis; instead, it quickly scans the instructions and records first
658  definitions and last uses of all relevant Variables, using that to construct a
659  single-segment live range.  Although this includes most of the Variables, the
660  live ranges are mostly simple, short, and rarely overlapping, which the
661  register allocator handles efficiently.  This is used by ``Om1`` lowering.
662
663- Post-phi lowering mode.  Advanced phi lowering is done after normal-mode
664  register allocation, and may result in new infinite-weight Variables that need
665  registers.  One would like to just run something like minimal mode to assign
666  registers to the new Variables while respecting existing register allocation
667  decisions.  However, it sometimes happens that there are no free registers.
668  In this case, some register needs to be forcibly spilled to the stack and
669  temporarily reassigned to the new Variable, and reloaded at the end of the new
670  Variable's live range.  The register must be one that has no explicit
671  references during the Variable's live range.  Since Subzero currently doesn't
672  track def/use chains (though it does record the CfgNode where a Variable is
673  defined), we just do a brute-force search across the CfgNode's instruction
674  list for the instruction numbers of interest.  This situation happens very
675  rarely, so there's little point for now in improving its performance.
676
677The basic linear-scan algorithm may, as it proceeds, rescind an early register
678allocation decision, leaving that Variable to be stack-allocated.  Some of these
679times, it turns out that the Variable could have been given a different register
680without conflict, but by this time it's too late.  The literature recognizes
681this situation and describes "second-chance bin-packing", which Subzero can do.
682We can rerun the register allocator in a mode that respects existing register
683allocation decisions, and sometimes it finds new non-conflicting opportunities.
684In fact, we can repeatedly run the register allocator until convergence.
685Unfortunately, in the current implementation, these subsequent register
686allocation passes end up being extremely expensive.  This is because of the
687treatment of the "unhandled pre-colored" Variable set, which is normally very
688small but ends up being quite large on subsequent passes.  Its performance can
689probably be made acceptable with a better choice of data structures, but for now
690this second-chance mechanism is disabled.
691
692Future work is to implement LLVM's `Greedy
693<http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_
694register allocator as a replacement for the basic linear-scan algorithm, given
695LLVM's experience with its improvement in code quality.  (The blog post claims
696that the Greedy allocator also improved maintainability because a lot of hacks
697could be removed, but Subzero is probably not yet to that level of hacks, and is
698less likely to see that particular benefit.)
699
700Local common subexpression elimination
701^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
702
703The Local CSE implementation goes through each instruction and records a portion
704of each ``Seen`` instruction in a hashset-like container.  That portion consists
705of the entire instruction except for any dest variable. That means ``A = X + Y``
706and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows
707us to detect common subexpressions.
708
709Whenever a repetition is detected, the redundant variables are stored in a
710container mapping the replacee to the replacement. In the case above, it would
711be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``.
712
713At any point if a variable that has an entry in the replacement table is
714encountered, it is replaced with the variable it is mapped to. This ensures that
715the redundant variables will not have any uses in the basic block, allowing
716dead code elimination to clean up the redundant instruction.
717
718With SSA, the information stored is never invalidated. However, non-SSA input is
719supported with the ``-lcse=no-ssa`` option. This has to maintain some extra
720dependency information to ensure proper invalidation on variable assignment.
721This is not rigorously tested because this pass is run at an early stage where
722it is safe to assume variables have a single definition. This is not enabled by
723default because it bumps the compile time overhead from 2% to 6%.
724
725Loop-invariant code motion
726^^^^^^^^^^^^^^^^^^^^^^^^^^
727
728This pass utilizes the loop analysis information to hoist invariant instructions
729to loop pre-headers. A loop must have a single entry node (header) and that node
730must have a single external predecessor for this optimization to work, as it is
731currently implemented.
732
733The pass works by iterating over all instructions in the loop until the set of
734invariant instructions converges. In each iteration, a non-invariant instruction
735involving only constants or a variable known to be invariant is added to the
736result set. The destination variable of that instruction is added to the set of
737variables known to be invariant (which is initialized with the constant args).
738
739Improving the loop-analysis infrastructure is likely to have significant impact
740on this optimization. Inserting an extra node to act as the pre-header when the
741header has multiple incoming edges from outside could also be a good idea.
742Expanding the initial invariant variable set to contain all variables that do
743not have definitions inside the loop does not seem to improve anything.
744
745Short circuit evaluation
746^^^^^^^^^^^^^^^^^^^^^^^^
747
748Short circuit evaluation splits nodes and introduces early jumps when the result
749of a logical operation can be determined early and there are no observable side
750effects of skipping the rest of the computation. The instructions considered
751backwards from the end of the basic blocks. When a definition of a variable
752involved in a conditional jump is found, an extra jump can be inserted in that
753location, moving the rest of the instructions in the node to a newly inserted
754node. Consider this example::
755
756  __N :
757    a = <something>
758    Instruction 1 without side effect
759    ... b = <something> ...
760    Instruction N without side effect
761    t1 = or a b
762    br t1 __X __Y
763
764is transformed to::
765
766  __N :
767    a = <something>
768    br a __X __N_ext
769
770  __N_ext :
771    Instruction 1 without side effect
772    ... b = <something> ...
773    Instruction N without side effect
774    br b __X __Y
775
776The logic for AND is analogous, the only difference is that the early jump is
777facilitated by a ``false`` value instead of ``true``.
778
779Global Variable Splitting
780^^^^^^^^^^^^^^^^^^^^^^^^^
781
782Global variable splitting (``-split-global-vars``) is run after register
783allocation. It works on the variables that did not manage to get registers (but
784are allowed to) and decomposes their live ranges into the individual segments
785(which span a single node at most). New variables are created (but not yet used)
786with these smaller live ranges and the register allocator is run again. This is
787not inefficient as the old variables that already had registers are now
788considered pre-colored.
789
790The new variables that get registers replace their parent variables for their
791portion of its (parent's) live range. A copy from the old variable to the new
792is introduced before the first use and the reverse after the last def in the
793live range.
794
795Basic phi lowering
796^^^^^^^^^^^^^^^^^^
797
798The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``
799implements it).  Consider this example::
800
801    L1:
802      ...
803      br L3
804    L2:
805      ...
806      br L3
807    L3:
808      A = phi [B, L1], [C, L2]
809      X = phi [Y, L1], [Z, L2]
810
811For each destination of a phi instruction, we can create a temporary and insert
812the temporary's assignment at the end of the predecessor block::
813
814    L1:
815      ...
816      A' = B
817      X' = Y
818      br L3
819    L2:
820      ...
821      A' = C
822      X' = Z
823      br L3
824    L2:
825      A = A'
826      X = X'
827
828This transformation is very simple and reliable.  It can be done before target
829lowering and register allocation, and it easily avoids the classic lost-copy and
830related problems.  ``Om1`` lowering uses this strategy.
831
832However, it has the disadvantage of initializing temporaries even for branches
833not taken, though that could be mitigated by splitting non-critical edges and
834putting assignments in the edge-split nodes.  Another problem is that without
835extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a
836specific ordering even though phi semantics are that the assignments are
837parallel or unordered.  This sometimes imposes false live range overlaps and
838leads to poorer register allocation.
839
840Advanced phi lowering
841^^^^^^^^^^^^^^^^^^^^^
842
843``O2`` lowering defers phi lowering until after register allocation to avoid the
844problem of false live range overlaps.  It works as follows.  We split each
845incoming edge and move the (parallel) phi assignments into the split nodes.  We
846linearize each set of assignments by finding a safe, topological ordering of the
847assignments, respecting register assignments as well.  For example::
848
849    A = B
850    X = Y
851
852Normally these assignments could be executed in either order, but if ``B`` and
853``X`` are assigned the same physical register, we would want to use the above
854ordering.  Dependency cycles are broken by introducing a temporary.  For
855example::
856
857    A = B
858    B = A
859
860Here, a temporary breaks the cycle::
861
862    t = A
863    A = B
864    B = t
865
866Finally, we use the existing target lowering to lower the assignments in this
867basic block, and once that is done for all basic blocks, we run the post-phi
868variant of register allocation on the edge-split basic blocks.
869
870When computing a topological order, we try to first schedule assignments whose
871source has a physical register, and last schedule assignments whose destination
872has a physical register.  This helps reduce register pressure.
873
874X86 address mode inference
875^^^^^^^^^^^^^^^^^^^^^^^^^^
876
877We try to take advantage of the x86 addressing mode that includes a base
878register, an index register, an index register scale amount, and an immediate
879offset.  We do this through simple pattern matching.  Starting with a load or
880store instruction where the address is a variable, we initialize the base
881register to that variable, and look up the instruction where that variable is
882defined.  If that is an add instruction of two variables and the index register
883hasn't been set, we replace the base and index register with those two
884variables.  If instead it is an add instruction of a variable and a constant, we
885replace the base register with the variable and add the constant to the
886immediate offset.
887
888There are several more patterns that can be matched.  This pattern matching
889continues on the load or store instruction until no more matches are found.
890Because a program typically has few load and store instructions (not to be
891confused with instructions that manipulate stack variables), this address mode
892inference pass is fast.
893
894X86 read-modify-write inference
895^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
896
897A reasonably common bitcode pattern is a non-atomic update of a memory
898location::
899
900    x = load addr
901    y = add x, 1
902    store y, addr
903
904On x86, with good register allocation, the Subzero passes described above
905generate code with only this quality::
906
907    mov [%ebx], %eax
908    add $1, %eax
909    mov %eax, [%ebx]
910
911However, x86 allows for this kind of code::
912
913    add $1, [%ebx]
914
915which requires fewer instructions, but perhaps more importantly, requires fewer
916physical registers.
917
918It's also important to note that this transformation only makes sense if the
919store instruction ends ``x``'s live range.
920
921Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)
922opportunities via simple pattern matching.  The only problem is that it is run
923before liveness analysis, which is needed to determine whether ``x``'s live
924range ends after the RMW.  Since liveness analysis is one of the most expensive
925passes, it's not attractive to run it an extra time just for RMW analysis.
926Instead, we essentially generate both the RMW and the non-RMW versions, and then
927during lowering, the RMW version deletes itself if it finds x still live.
928
929X86 compare-branch inference
930^^^^^^^^^^^^^^^^^^^^^^^^^^^^
931
932In the LLVM instruction set, the compare/branch pattern works like this::
933
934    cond = icmp eq a, b
935    br cond, target
936
937The result of the icmp instruction is a single bit, and a conditional branch
938tests that bit.  By contrast, most target architectures use this pattern::
939
940    cmp a, b  // implicitly sets various bits of FLAGS register
941    br eq, target  // branch on a particular FLAGS bit
942
943A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests
944``cond`` and conditionally branches.  Subzero has a pass that identifies
945boolean-based operations like this and folds them into a single
946compare/branch-like operation.  It is set up for more than just cmp/br though.
947Boolean producers include icmp (integer compare), fcmp (floating-point compare),
948and trunc (integer truncation when the destination has bool type).  Boolean
949consumers include branch, select (the ternary operator from the C language), and
950sign-extend and zero-extend when the source has bool type.
951
952Code emission
953-------------
954
955Subzero's integrated assembler is derived from Dart's `assembler code
956<https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_.  There is a pass
957that iterates through the low-level ICE instructions and invokes the relevant
958assembler functions.  Placeholders are added for later fixup of branch target
959offsets.  (Backward branches use short offsets if possible; forward branches
960generally use long offsets unless it is an intra-block branch of "known" short
961length.)  The assembler emits into a staging buffer.  Once emission into the
962staging buffer for a function is complete, the data is emitted to the output
963file as an ELF object file, and metadata such as relocations, symbol table, and
964string table, are accumulated for emission at the end.  Global data initializers
965are emitted similarly.  A key point is that at this point, the staging buffer
966can be deallocated, and only a minimum of data needs to held until the end.
967
968As a debugging alternative, Subzero can emit textual assembly code which can
969then be run through an external assembler.  This is of course super slow, but
970quite valuable when bringing up a new target.
971
972As another debugging option, the staging buffer can be emitted as textual
973assembly, primarily in the form of ".byte" lines.  This allows the assembler to
974be tested separately from the ELF related code.
975
976Memory management
977-----------------
978
979Where possible, we allocate from a ``CfgLocalAllocator`` which derives from
980LLVM's ``BumpPtrAllocator``.  This is an arena-style allocator where objects
981allocated from the arena are never actually freed; instead, when the CFG
982translation completes and the CFG is deleted, the entire arena memory is
983reclaimed at once.  This style of allocation works well in an environment like a
984compiler where there are distinct phases with only easily-identifiable objects
985living across phases.  It frees the developer from having to manage object
986deletion, and it amortizes deletion costs across just a single arena deletion at
987the end of the phase.  Furthermore, it helps scalability by allocating entirely
988from thread-local memory pools, and minimizing global locking of the heap.
989
990Instructions are probably the most heavily allocated complex class in Subzero.
991We represent an instruction list as an intrusive doubly linked list, allocate
992all instructions from the ``CfgLocalAllocator``, and we make sure each
993instruction subclass is basically `POD
994<http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a
995trivial destructor.  This way, when the CFG is finished, we don't need to
996individually deallocate every instruction.  We do similar for Variables, which
997is probably the second most popular complex class.
998
999There are some situations where passes need to use some `STL container class
1000<http://en.cppreference.com/w/cpp/container>`_.  Subzero has a way of using the
1001``CfgLocalAllocator`` as the container allocator if this is needed.
1002
1003Multithreaded translation
1004-------------------------
1005
1006Subzero is designed to be able to translate functions in parallel.  With the
1007``-threads=N`` command-line option, there is a 3-stage producer-consumer
1008pipeline:
1009
1010- A single thread parses the ``.pexe`` file and produces a sequence of work
1011  units.  A work unit can be either a fully constructed CFG, or a set of global
1012  initializers.  The work unit includes its sequence number denoting its parse
1013  order.  Each work unit is added to the translation queue.
1014
1015- There are N translation threads that draw work units from the translation
1016  queue and lower them into assembler buffers.  Each assembler buffer is added
1017  to the emitter queue, tagged with its sequence number.  The CFG and its
1018  ``CfgLocalAllocator`` are disposed of at this point.
1019
1020- A single thread draws assembler buffers from the emitter queue and appends to
1021  the output file.  It uses the sequence numbers to reintegrate the assembler
1022  buffers according to the original parse order, such that output order is
1023  always deterministic.
1024
1025This means that with ``-threads=N``, there are actually ``N+1`` spawned threads
1026for a total of ``N+2`` execution threads, taking the parser and emitter threads
1027into account.  For the special case of ``N=0``, execution is entirely sequential
1028-- the same thread parses, translates, and emits, one function at a time.  This
1029is useful for performance measurements.
1030
1031Ideally, we would like to get near-linear scalability as the number of
1032translation threads increases.  We expect that ``-threads=1`` should be slightly
1033faster than ``-threads=0`` as the small amount of time spent parsing and
1034emitting is done largely in parallel with translation.  With perfect
1035scalability, we see ``-threads=N`` translating ``N`` times as fast as
1036``-threads=1``, up until the point where parsing or emitting becomes the
1037bottleneck, or ``N+2`` exceeds the number of CPU cores.  In reality, memory
1038performance would become a bottleneck and efficiency might peak at, say, 75%.
1039
1040Currently, parsing takes about 11% of total sequential time.  If translation
1041scalability ever gets so fast and awesomely scalable that parsing becomes a
1042bottleneck, it should be possible to make parsing multithreaded as well.
1043
1044Internally, all shared, mutable data is held in the GlobalContext object, and
1045access to each field is guarded by a mutex.
1046
1047Security
1048--------
1049
1050Subzero includes a number of security features in the generated code, as well as
1051in the Subzero translator itself, which run on top of the existing Native Client
1052sandbox as well as Chrome's OS-level sandbox.
1053
1054Sandboxed translator
1055^^^^^^^^^^^^^^^^^^^^
1056
1057When running inside the browser, the Subzero translator executes as sandboxed,
1058untrusted code that is initially checked by the validator, just like the
1059LLVM-based ``pnacl-llc`` translator.  As such, the Subzero binary should be no
1060more or less secure than the translator it replaces, from the point of view of
1061the Chrome sandbox.  That said, Subzero is much smaller than ``pnacl-llc`` and
1062was designed from the start with security in mind, so one expects fewer attacker
1063opportunities here.
1064
1065Fuzzing
1066^^^^^^^
1067
1068We have started fuzz-testing the ``.pexe`` files input to Subzero, using a
1069combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer
1070<http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling.  The purpose is to
1071find and fix cases where Subzero crashes or otherwise ungracefully fails on
1072unexpected inputs, and to do so automatically over a large range of unexpected
1073inputs.  By fixing bugs that arise from fuzz testing, we reduce the possibility
1074of an attacker exploiting these bugs.
1075
1076Most of the problems found so far are ones most appropriately handled in the
1077parser.  However, there have been a couple that have identified problems in the
1078lowering, or otherwise inappropriately triggered assertion failures and fatal
1079errors.  We continue to dig into this area.
1080
1081Future security work
1082^^^^^^^^^^^^^^^^^^^^
1083
1084Subzero is well-positioned to explore other future security enhancements, e.g.:
1085
1086- Tightening the Native Client sandbox.  ABI changes, such as the previous work
1087  on `hiding the sandbox base address
1088  <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_
1089  in x86-64, are easy to experiment with in Subzero.
1090
1091- Making the executable code section read-only.  This would prevent a PNaCl
1092  application from inspecting its own binary and trying to find ROP gadgets even
1093  after code diversification has been performed.  It may still be susceptible to
1094  `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,
1095  security is still overall improved.
1096
1097- Instruction selection diversification.  It may be possible to lower a given
1098  instruction in several largely equivalent ways, which gives more opportunities
1099  for code randomization.
1100
1101Chrome integration
1102------------------
1103
1104Currently Subzero is available in Chrome for the x86-32 architecture, but under
1105a flag.  When the flag is enabled, Subzero is used when the `manifest file
1106<https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_
1107linking to the ``.pexe`` file specifies the ``O0`` optimization level.
1108
1109The next step is to remove the flag, i.e. invoke Subzero as the only translator
1110for ``O0``-specified manifest files.
1111
1112Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which
1113case Subzero could be used for all PNaCl translation.
1114
1115Command line options
1116--------------------
1117
1118Subzero has a number of command-line options for debugging and diagnostics.
1119Among the more interesting are the following.
1120
1121- Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other
1122  diagnostic output, with various levels of detail after each pass.  Instruction
1123  numbers can be printed or suppressed.  Deleted instructions can be printed or
1124  suppressed (they are retained in the instruction list, as discussed earlier,
1125  because they can help explain how lower-level instructions originated).
1126  Liveness information can be printed when available.  Details of register
1127  allocation can be printed as register allocator decisions are made.  And more.
1128
1129- Running Subzero with any level of verbosity produces an enormous amount of
1130  output.  When debugging a single function, verbose output can be suppressed
1131  except for a particular function.  The ``-verbose-focus`` flag suppresses
1132  verbose output except for the specified function.
1133
1134- Subzero has a ``-timing`` option that prints a breakdown of pass-level timing
1135  at exit.  Timing markers can be placed in the Subzero source code to demarcate
1136  logical operations or passes of interest.  Basic timing information plus
1137  call-stack type timing information is printed at the end.
1138
1139- Along with ``-timing``, the user can instead get a report on the overall
1140  translation time for each function, to help focus on timing outliers.  Also,
1141  ``-timing-focus`` limits the ``-timing`` reporting to a single function,
1142  instead of aggregating pass timing across all functions.
1143
1144- The ``-szstats`` option reports various statistics on each function, such as
1145  stack frame size, static instruction count, etc.  It may be helpful to track
1146  these stats over time as Subzero is improved, as an approximate measure of
1147  code quality.
1148
1149- The flag ``-asm-verbose``, in conjunction with emitting textual assembly
1150  output, annotate the assembly output with register-focused liveness
1151  information.  In particular, each basic block is annotated with which
1152  registers are live-in and live-out, and each instruction is annotated with
1153  which registers' and stack locations' live ranges end at that instruction.
1154  This is really useful when studying the generated code to find opportunities
1155  for code quality improvements.
1156
1157Testing and debugging
1158---------------------
1159
1160LLVM lit tests
1161^^^^^^^^^^^^^^
1162
1163For basic testing, Subzero uses LLVM's `lit
1164<http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests.  We
1165have a suite of hundreds of small functions where we test for particular
1166assembly code patterns across different target architectures.
1167
1168Cross tests
1169^^^^^^^^^^^
1170
1171Unfortunately, the lit tests don't do a great job of precisely testing the
1172correctness of the output.  Much better are the cross tests, which are execution
1173tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide
1174variety of interesting inputs.  Each cross test consists of a set of C, C++,
1175and/or low-level bitcode files.  The C and C++ source files are compiled down to
1176bitcode.  The bitcode files are translated by ``pnacl-llc`` and also by Subzero.
1177Subzero mangles global symbol names with a special prefix to avoid duplicate
1178symbol errors.  A driver program invokes both versions on a large set of
1179interesting inputs, and reports when the Subzero and ``pnacl-llc`` results
1180differ.  Cross tests turn out to be an excellent way of testing the basic
1181lowering patterns, but they are less useful for testing more global things like
1182liveness analysis and register allocation.
1183
1184Bisection debugging
1185^^^^^^^^^^^^^^^^^^^
1186
1187Sometimes with a new application, Subzero will end up producing incorrect code
1188that either crashes at runtime or otherwise produces the wrong results.  When
1189this happens, we need to narrow it down to a single function (or small set of
1190functions) that yield incorrect behavior.  For this, we have a bisection
1191debugging framework.  Here, we initially translate the entire application once
1192with Subzero and once with ``pnacl-llc``.  We then use ``objdump`` to
1193selectively weaken symbols based on a list provided on the command line. The
1194two object files can then be linked together without link errors, with the
1195desired version of each method "winning".  Then the binary is tested, and
1196bisection proceeds based on whether the binary produces correct output.
1197
1198When the bisection completes, we are left with a minimal set of
1199Subzero-translated functions that cause the failure.  Usually it is a single
1200function, though sometimes it might require a combination of several functions
1201to cause a failure; this may be due to an incorrect call ABI, for example.
1202However, Murphy's Law implies that the single failing function is enormous and
1203impractical to debug.  In that case, we can restart the bisection, explicitly
1204ignoring the enormous function, and try to find another candidate to debug.
1205(Future work is to automate this to find all minimal sets of functions, so that
1206debugging can focus on the simplest example.)
1207
1208Fuzz testing
1209^^^^^^^^^^^^
1210
1211As described above, we try to find internal Subzero bugs using fuzz testing
1212techniques.
1213
1214Sanitizers
1215^^^^^^^^^^
1216
1217Subzero can be built with `AddressSanitizer
1218<http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer
1219<http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support.  This is
1220done using something as simple as ``make ASAN=1`` or ``make TSAN=1``.  So far,
1221multithreading has been simple enough that TSan hasn't found any bugs, but ASan
1222has found at least one memory leak which was subsequently fixed.
1223`UndefinedBehaviorSanitizer
1224<http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_
1225(UBSan) support is in progress.  `Control flow integrity sanitization
1226<http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under
1227consideration.
1228
1229Current status
1230==============
1231
1232Target architectures
1233--------------------
1234
1235Subzero is currently more or less complete for the x86-32 target.  It has been
1236refactored and extended to handle x86-64 as well, and that is mostly complete at
1237this point.
1238
1239ARM32 work is in progress.  It currently lacks the testing level of x86, at
1240least in part because Subzero's register allocator needs modifications to handle
1241ARM's aliasing of floating point and vector registers.  Specifically, a 64-bit
1242register is actually a gang of two consecutive and aligned 32-bit registers, and
1243a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.
1244ARM64 work has not started; when it does, it will be native-only since the
1245Native Client sandbox model, validator, and other tools have never been defined.
1246
1247An external contributor is adding MIPS support, in most part by following the
1248ARM work.
1249
1250Translator performance
1251----------------------
1252
1253Single-threaded translation speed is currently about 5× the ``pnacl-llc``
1254translation speed.  For a large ``.pexe`` file, the time breaks down as:
1255
1256- 11% for parsing and initial IR building
1257
1258- 4% for emitting to /dev/null
1259
1260- 27% for liveness analysis (two liveness passes plus live range construction)
1261
1262- 15% for linear-scan register allocation
1263
1264- 9% for basic lowering
1265
1266- 10% for advanced phi lowering
1267
1268- ~11% for other minor analysis
1269
1270- ~10% measurement overhead to acquire these numbers
1271
1272Some improvements could undoubtedly be made, but it will be hard to increase the
1273speed to 10× of ``pnacl-llc`` while keeping acceptable code quality.  With
1274``-Om1`` (lack of) optimization, we do actually achieve roughly 10×
1275``pnacl-llc`` translation speed, but code quality drops by a factor of 3.
1276
1277Code quality
1278------------
1279
1280Measured across 16 components of spec2k, Subzero's code quality is uniformly
1281better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly
1282between ``pnacl-llc`` ``-O0`` and ``-O2``.
1283
1284Translator size
1285---------------
1286
1287When built in MINIMAL mode, the x86-64 native translator size for the x86-32
1288target is about 700 KB, not including the size of functions referenced in
1289dynamically-linked libraries.  The sandboxed version of Subzero is a bit over 1
1290MB, and it is statically linked and also includes nop padding for bundling as
1291well as indirect branch masking.
1292
1293Translator memory footprint
1294---------------------------
1295
1296It's hard to draw firm conclusions about memory footprint, since the footprint
1297is at least proportional to the input function size, and there is no real limit
1298on the size of functions in the ``.pexe`` file.
1299
1300That said, we looked at the memory footprint over time as Subzero translated
1301``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our
1302disposal.  One of LLVM's libraries that Subzero uses can report the current
1303malloc heap usage.  With single-threaded translation, Subzero tends to hover
1304around 15 MB of memory usage.  There are a couple of monstrous functions where
1305Subzero grows to around 100 MB, but then it drops back down after those
1306functions finish translating.  In contrast, ``pnacl-llc`` grows larger and
1307larger throughout translation, reaching several hundred MB by the time it
1308completes.
1309
1310It's a bit more interesting when we enable multithreaded translation.  When
1311there are N translation threads, Subzero implements a policy that limits the
1312size of the translation queue to N entries -- if it is "full" when the parser
1313tries to add a new CFG, the parser blocks until one of the translation threads
1314removes a CFG.  This means the number of in-memory CFGs can (and generally does)
1315reach 2*N+1, and so the memory footprint rises in proportion to the number of
1316threads.  Adding to the pressure is the observation that the monstrous functions
1317also take proportionally longer time to translate, so there's a good chance many
1318of the monstrous functions will be active at the same time with multithreaded
1319translation.  As a result, for N=32, Subzero's memory footprint peaks at about
1320260 MB, but drops back down as the large functions finish translating.
1321
1322If this peak memory size becomes a problem, it might be possible for the parser
1323to resequence the functions to try to spread out the larger functions, or to
1324throttle the translation queue to prevent too many in-flight large functions.
1325It may also be possible to throttle based on memory pressure signaling from
1326Chrome.
1327
1328Translator scalability
1329----------------------
1330
1331Currently scalability is "not very good".  Multiple translation threads lead to
1332faster translation, but not to the degree desired.  We haven't dug in to
1333investigate yet.
1334
1335There are a few areas to investigate.  First, there may be contention on the
1336constant pool, which all threads access, and which requires locked access even
1337for reading.  This could be mitigated by keeping a CFG-local cache of the most
1338common constants.
1339
1340Second, there may be contention on memory allocation.  While almost all CFG
1341objects are allocated from the CFG-local allocator, some passes use temporary
1342STL containers that use the default allocator, which may require global locking.
1343This could be mitigated by switching these to the CFG-local allocator.
1344
1345Third, multithreading may make the default allocator strategy more expensive.
1346In a single-threaded environment, a pass will allocate its containers, run the
1347pass, and deallocate the containers.  This results in stack-like allocation
1348behavior and makes the heap free list easier to manage, with less heap
1349fragmentation.  But when multithreading is added, the allocations and
1350deallocations become much less stack-like, making allocation and deallocation
1351operations individually more expensive.  Again, this could be mitigated by
1352switching these to the CFG-local allocator.
1353