xref: /aosp_15_r20/external/llvm/docs/BigEndianNEON.rst (revision 9880d6810fe72a1726cb53787c6711e909410d58)
1*9880d681SAndroid Build Coastguard Worker==============================================
2*9880d681SAndroid Build Coastguard WorkerUsing ARM NEON instructions in big endian mode
3*9880d681SAndroid Build Coastguard Worker==============================================
4*9880d681SAndroid Build Coastguard Worker
5*9880d681SAndroid Build Coastguard Worker.. contents::
6*9880d681SAndroid Build Coastguard Worker    :local:
7*9880d681SAndroid Build Coastguard Worker
8*9880d681SAndroid Build Coastguard WorkerIntroduction
9*9880d681SAndroid Build Coastguard Worker============
10*9880d681SAndroid Build Coastguard Worker
11*9880d681SAndroid Build Coastguard WorkerGenerating code for big endian ARM processors is for the most part straightforward. NEON loads and stores however have some interesting properties that make code generation decisions less obvious in big endian mode.
12*9880d681SAndroid Build Coastguard Worker
13*9880d681SAndroid Build Coastguard WorkerThe aim of this document is to explain the problem with NEON loads and stores, and the solution that has been implemented in LLVM.
14*9880d681SAndroid Build Coastguard Worker
15*9880d681SAndroid Build Coastguard WorkerIn this document the term "vector" refers to what the ARM ABI calls a "short vector", which is a sequence of items that can fit in a NEON register. This sequence can be 64 or 128 bits in length, and can constitute 8, 16, 32 or 64 bit items. This document refers to A64 instructions throughout, but is almost applicable to the A32/ARMv7 instruction sets also. The ABI format for passing vectors in A32 is sligtly different to A64. Apart from that, the same concepts apply.
16*9880d681SAndroid Build Coastguard Worker
17*9880d681SAndroid Build Coastguard WorkerExample: C-level intrinsics -> assembly
18*9880d681SAndroid Build Coastguard Worker---------------------------------------
19*9880d681SAndroid Build Coastguard Worker
20*9880d681SAndroid Build Coastguard WorkerIt may be helpful first to illustrate how C-level ARM NEON intrinsics are lowered to instructions.
21*9880d681SAndroid Build Coastguard Worker
22*9880d681SAndroid Build Coastguard WorkerThis trivial C function takes a vector of four ints and sets the zero'th lane to the value "42"::
23*9880d681SAndroid Build Coastguard Worker
24*9880d681SAndroid Build Coastguard Worker    #include <arm_neon.h>
25*9880d681SAndroid Build Coastguard Worker    int32x4_t f(int32x4_t p) {
26*9880d681SAndroid Build Coastguard Worker        return vsetq_lane_s32(42, p, 0);
27*9880d681SAndroid Build Coastguard Worker    }
28*9880d681SAndroid Build Coastguard Worker
29*9880d681SAndroid Build Coastguard Workerarm_neon.h intrinsics generate "generic" IR where possible (that is, normal IR instructions not ``llvm.arm.neon.*`` intrinsic calls). The above generates::
30*9880d681SAndroid Build Coastguard Worker
31*9880d681SAndroid Build Coastguard Worker    define <4 x i32> @f(<4 x i32> %p) {
32*9880d681SAndroid Build Coastguard Worker      %vset_lane = insertelement <4 x i32> %p, i32 42, i32 0
33*9880d681SAndroid Build Coastguard Worker      ret <4 x i32> %vset_lane
34*9880d681SAndroid Build Coastguard Worker    }
35*9880d681SAndroid Build Coastguard Worker
36*9880d681SAndroid Build Coastguard WorkerWhich then becomes the following trivial assembly::
37*9880d681SAndroid Build Coastguard Worker
38*9880d681SAndroid Build Coastguard Worker    f:                                      // @f
39*9880d681SAndroid Build Coastguard Worker            movz	w8, #0x2a
40*9880d681SAndroid Build Coastguard Worker            ins 	v0.s[0], w8
41*9880d681SAndroid Build Coastguard Worker            ret
42*9880d681SAndroid Build Coastguard Worker
43*9880d681SAndroid Build Coastguard WorkerProblem
44*9880d681SAndroid Build Coastguard Worker=======
45*9880d681SAndroid Build Coastguard Worker
46*9880d681SAndroid Build Coastguard WorkerThe main problem is how vectors are represented in memory and in registers.
47*9880d681SAndroid Build Coastguard Worker
48*9880d681SAndroid Build Coastguard WorkerFirst, a recap. The "endianness" of an item affects its representation in memory only. In a register, a number is just a sequence of bits - 64 bits in the case of AArch64 general purpose registers. Memory, however, is a sequence of addressable units of 8 bits in size. Any number greater than 8 bits must therefore be split up into 8-bit chunks, and endianness describes the order in which these chunks are laid out in memory.
49*9880d681SAndroid Build Coastguard Worker
50*9880d681SAndroid Build Coastguard WorkerA "little endian" layout has the least significant byte first (lowest in memory address). A "big endian" layout has the *most* significant byte first. This means that when loading an item from big endian memory, the lowest 8-bits in memory must go in the most significant 8-bits, and so forth.
51*9880d681SAndroid Build Coastguard Worker
52*9880d681SAndroid Build Coastguard Worker``LDR`` and ``LD1``
53*9880d681SAndroid Build Coastguard Worker===================
54*9880d681SAndroid Build Coastguard Worker
55*9880d681SAndroid Build Coastguard Worker.. figure:: ARM-BE-ldr.png
56*9880d681SAndroid Build Coastguard Worker    :align: right
57*9880d681SAndroid Build Coastguard Worker
58*9880d681SAndroid Build Coastguard Worker    Big endian vector load using ``LDR``.
59*9880d681SAndroid Build Coastguard Worker
60*9880d681SAndroid Build Coastguard Worker
61*9880d681SAndroid Build Coastguard WorkerA vector is a consecutive sequence of items that are operated on simultaneously. To load a 64-bit vector, 64 bits need to be read from memory. In little endian mode, we can do this by just performing a 64-bit load - ``LDR q0, [foo]``. However if we try this in big endian mode, because of the byte swapping the lane indices end up being swapped! The zero'th item as laid out in memory becomes the n'th lane in the vector.
62*9880d681SAndroid Build Coastguard Worker
63*9880d681SAndroid Build Coastguard Worker.. figure:: ARM-BE-ld1.png
64*9880d681SAndroid Build Coastguard Worker    :align: right
65*9880d681SAndroid Build Coastguard Worker
66*9880d681SAndroid Build Coastguard Worker    Big endian vector load using ``LD1``. Note that the lanes retain the correct ordering.
67*9880d681SAndroid Build Coastguard Worker
68*9880d681SAndroid Build Coastguard Worker
69*9880d681SAndroid Build Coastguard WorkerBecause of this, the instruction ``LD1`` performs a vector load but performs byte swapping not on the entire 64 bits, but on the individual items within the vector. This means that the register content is the same as it would have been on a little endian system.
70*9880d681SAndroid Build Coastguard Worker
71*9880d681SAndroid Build Coastguard WorkerIt may seem that ``LD1`` should suffice to peform vector loads on a big endian machine. However there are pros and cons to the two approaches that make it less than simple which register format to pick.
72*9880d681SAndroid Build Coastguard Worker
73*9880d681SAndroid Build Coastguard WorkerThere are two options:
74*9880d681SAndroid Build Coastguard Worker
75*9880d681SAndroid Build Coastguard Worker    1. The content of a vector register is the same *as if* it had been loaded with an ``LDR`` instruction.
76*9880d681SAndroid Build Coastguard Worker    2. The content of a vector register is the same *as if* it had been loaded with an ``LD1`` instruction.
77*9880d681SAndroid Build Coastguard Worker
78*9880d681SAndroid Build Coastguard WorkerBecause ``LD1 == LDR + REV`` and similarly ``LDR == LD1 + REV`` (on a big endian system), we can simulate either type of load with the other type of load plus a ``REV`` instruction. So we're not deciding which instructions to use, but which format to use (which will then influence which instruction is best to use).
79*9880d681SAndroid Build Coastguard Worker
80*9880d681SAndroid Build Coastguard Worker.. The 'clearer' container is required to make the following section header come after the floated
81*9880d681SAndroid Build Coastguard Worker   images above.
82*9880d681SAndroid Build Coastguard Worker.. container:: clearer
83*9880d681SAndroid Build Coastguard Worker
84*9880d681SAndroid Build Coastguard Worker    Note that throughout this section we only mention loads. Stores have exactly the same problems as their associated loads, so have been skipped for brevity.
85*9880d681SAndroid Build Coastguard Worker
86*9880d681SAndroid Build Coastguard Worker
87*9880d681SAndroid Build Coastguard WorkerConsiderations
88*9880d681SAndroid Build Coastguard Worker==============
89*9880d681SAndroid Build Coastguard Worker
90*9880d681SAndroid Build Coastguard WorkerLLVM IR Lane ordering
91*9880d681SAndroid Build Coastguard Worker---------------------
92*9880d681SAndroid Build Coastguard Worker
93*9880d681SAndroid Build Coastguard WorkerLLVM IR has first class vector types. In LLVM IR, the zero'th element of a vector resides at the lowest memory address. The optimizer relies on this property in certain areas, for example when concatenating vectors together. The intention is for arrays and vectors to have identical memory layouts - ``[4 x i8]`` and ``<4 x i8>`` should be represented the same in memory. Without this property there would be many special cases that the optimizer would have to cleverly handle.
94*9880d681SAndroid Build Coastguard Worker
95*9880d681SAndroid Build Coastguard WorkerUse of ``LDR`` would break this lane ordering property. This doesn't preclude the use of ``LDR``, but we would have to do one of two things:
96*9880d681SAndroid Build Coastguard Worker
97*9880d681SAndroid Build Coastguard Worker   1. Insert a ``REV`` instruction to reverse the lane order after every ``LDR``.
98*9880d681SAndroid Build Coastguard Worker   2. Disable all optimizations that rely on lane layout, and for every access to an individual lane (``insertelement``/``extractelement``/``shufflevector``) reverse the lane index.
99*9880d681SAndroid Build Coastguard Worker
100*9880d681SAndroid Build Coastguard WorkerAAPCS
101*9880d681SAndroid Build Coastguard Worker-----
102*9880d681SAndroid Build Coastguard Worker
103*9880d681SAndroid Build Coastguard WorkerThe ARM procedure call standard (AAPCS) defines the ABI for passing vectors between functions in registers. It states:
104*9880d681SAndroid Build Coastguard Worker
105*9880d681SAndroid Build Coastguard Worker    When a short vector is transferred between registers and memory it is treated as an opaque object. That is a short vector is stored in memory as if it were stored with a single ``STR`` of the entire register; a short vector is loaded from memory using the corresponding ``LDR`` instruction. On a little-endian system this means that element 0 will always contain the lowest addressed element of a short vector; on a big-endian system element 0 will contain the highest-addressed element of a short vector.
106*9880d681SAndroid Build Coastguard Worker
107*9880d681SAndroid Build Coastguard Worker    -- Procedure Call Standard for the ARM 64-bit Architecture (AArch64), 4.1.2 Short Vectors
108*9880d681SAndroid Build Coastguard Worker
109*9880d681SAndroid Build Coastguard WorkerThe use of ``LDR`` and ``STR`` as the ABI defines has at least one advantage over ``LD1`` and ``ST1``. ``LDR`` and ``STR`` are oblivious to the size of the individual lanes of a vector. ``LD1`` and ``ST1`` are not - the lane size is encoded within them. This is important across an ABI boundary, because it would become necessary to know the lane width the callee expects. Consider the following code:
110*9880d681SAndroid Build Coastguard Worker
111*9880d681SAndroid Build Coastguard Worker.. code-block:: c
112*9880d681SAndroid Build Coastguard Worker
113*9880d681SAndroid Build Coastguard Worker    <callee.c>
114*9880d681SAndroid Build Coastguard Worker    void callee(uint32x2_t v) {
115*9880d681SAndroid Build Coastguard Worker      ...
116*9880d681SAndroid Build Coastguard Worker    }
117*9880d681SAndroid Build Coastguard Worker
118*9880d681SAndroid Build Coastguard Worker    <caller.c>
119*9880d681SAndroid Build Coastguard Worker    extern void callee(uint32x2_t);
120*9880d681SAndroid Build Coastguard Worker    void caller() {
121*9880d681SAndroid Build Coastguard Worker      callee(...);
122*9880d681SAndroid Build Coastguard Worker    }
123*9880d681SAndroid Build Coastguard Worker
124*9880d681SAndroid Build Coastguard WorkerIf ``callee`` changed its signature to ``uint16x4_t``, which is equivalent in register content, if we passed as ``LD1`` we'd break this code until ``caller`` was updated and recompiled.
125*9880d681SAndroid Build Coastguard Worker
126*9880d681SAndroid Build Coastguard WorkerThere is an argument that if the signatures of the two functions are different then the behaviour should be undefined. But there may be functions that are agnostic to the lane layout of the vector, and treating the vector as an opaque value (just loading it and storing it) would be impossible without a common format across ABI boundaries.
127*9880d681SAndroid Build Coastguard Worker
128*9880d681SAndroid Build Coastguard WorkerSo to preserve ABI compatibility, we need to use the ``LDR`` lane layout across function calls.
129*9880d681SAndroid Build Coastguard Worker
130*9880d681SAndroid Build Coastguard WorkerAlignment
131*9880d681SAndroid Build Coastguard Worker---------
132*9880d681SAndroid Build Coastguard Worker
133*9880d681SAndroid Build Coastguard WorkerIn strict alignment mode, ``LDR qX`` requires its address to be 128-bit aligned, whereas ``LD1`` only requires it to be as aligned as the lane size. If we canonicalised on using ``LDR``, we'd still need to use ``LD1`` in some places to avoid alignment faults (the result of the ``LD1`` would then need to be reversed with ``REV``).
134*9880d681SAndroid Build Coastguard Worker
135*9880d681SAndroid Build Coastguard WorkerMost operating systems however do not run with alignment faults enabled, so this is often not an issue.
136*9880d681SAndroid Build Coastguard Worker
137*9880d681SAndroid Build Coastguard WorkerSummary
138*9880d681SAndroid Build Coastguard Worker-------
139*9880d681SAndroid Build Coastguard Worker
140*9880d681SAndroid Build Coastguard WorkerThe following table summarises the instructions that are required to be emitted for each property mentioned above for each of the two solutions.
141*9880d681SAndroid Build Coastguard Worker
142*9880d681SAndroid Build Coastguard Worker+-------------------------------+-------------------------------+---------------------+
143*9880d681SAndroid Build Coastguard Worker|                               | ``LDR`` layout                | ``LD1`` layout      |
144*9880d681SAndroid Build Coastguard Worker+===============================+===============================+=====================+
145*9880d681SAndroid Build Coastguard Worker| Lane ordering                 |   ``LDR + REV``               |    ``LD1``          |
146*9880d681SAndroid Build Coastguard Worker+-------------------------------+-------------------------------+---------------------+
147*9880d681SAndroid Build Coastguard Worker| AAPCS                         |   ``LDR``                     |    ``LD1 + REV``    |
148*9880d681SAndroid Build Coastguard Worker+-------------------------------+-------------------------------+---------------------+
149*9880d681SAndroid Build Coastguard Worker| Alignment for strict mode     |   ``LDR`` / ``LD1 + REV``     |    ``LD1``          |
150*9880d681SAndroid Build Coastguard Worker+-------------------------------+-------------------------------+---------------------+
151*9880d681SAndroid Build Coastguard Worker
152*9880d681SAndroid Build Coastguard WorkerNeither approach is perfect, and choosing one boils down to choosing the lesser of two evils. The issue with lane ordering, it was decided, would have to change target-agnostic compiler passes and would result in a strange IR in which lane indices were reversed. It was decided that this was worse than the changes that would have to be made to support ``LD1``, so ``LD1`` was chosen as the canonical vector load instruction (and by inference, ``ST1`` for vector stores).
153*9880d681SAndroid Build Coastguard Worker
154*9880d681SAndroid Build Coastguard WorkerImplementation
155*9880d681SAndroid Build Coastguard Worker==============
156*9880d681SAndroid Build Coastguard Worker
157*9880d681SAndroid Build Coastguard WorkerThere are 3 parts to the implementation:
158*9880d681SAndroid Build Coastguard Worker
159*9880d681SAndroid Build Coastguard Worker    1. Predicate ``LDR`` and ``STR`` instructions so that they are never allowed to be selected to generate vector loads and stores. The exception is one-lane vectors [1]_ - these by definition cannot have lane ordering problems so are fine to use ``LDR``/``STR``.
160*9880d681SAndroid Build Coastguard Worker
161*9880d681SAndroid Build Coastguard Worker    2. Create code generation patterns for bitconverts that create ``REV`` instructions.
162*9880d681SAndroid Build Coastguard Worker
163*9880d681SAndroid Build Coastguard Worker    3. Make sure appropriate bitconverts are created so that vector values get passed over call boundaries as 1-element vectors (which is the same as if they were loaded with ``LDR``).
164*9880d681SAndroid Build Coastguard Worker
165*9880d681SAndroid Build Coastguard WorkerBitconverts
166*9880d681SAndroid Build Coastguard Worker-----------
167*9880d681SAndroid Build Coastguard Worker
168*9880d681SAndroid Build Coastguard Worker.. image:: ARM-BE-bitcastfail.png
169*9880d681SAndroid Build Coastguard Worker    :align: right
170*9880d681SAndroid Build Coastguard Worker
171*9880d681SAndroid Build Coastguard WorkerThe main problem with the ``LD1`` solution is dealing with bitconverts (or bitcasts, or reinterpret casts). These are pseudo instructions that only change the compiler's interpretation of data, not the underlying data itself. A requirement is that if data is loaded and then saved again (called a "round trip"), the memory contents should be the same after the store as before the load. If a vector is loaded and is then bitconverted to a different vector type before storing, the round trip will currently be broken.
172*9880d681SAndroid Build Coastguard Worker
173*9880d681SAndroid Build Coastguard WorkerTake for example this code sequence::
174*9880d681SAndroid Build Coastguard Worker
175*9880d681SAndroid Build Coastguard Worker    %0 = load <4 x i32> %x
176*9880d681SAndroid Build Coastguard Worker    %1 = bitcast <4 x i32> %0 to <2 x i64>
177*9880d681SAndroid Build Coastguard Worker         store <2 x i64> %1, <2 x i64>* %y
178*9880d681SAndroid Build Coastguard Worker
179*9880d681SAndroid Build Coastguard WorkerThis would produce a code sequence such as that in the figure on the right. The mismatched ``LD1`` and ``ST1`` cause the stored data to differ from the loaded data.
180*9880d681SAndroid Build Coastguard Worker
181*9880d681SAndroid Build Coastguard Worker.. container:: clearer
182*9880d681SAndroid Build Coastguard Worker
183*9880d681SAndroid Build Coastguard Worker    When we see a bitcast from type ``X`` to type ``Y``, what we need to do is to change the in-register representation of the data to be *as if* it had just been loaded by a ``LD1`` of type ``Y``.
184*9880d681SAndroid Build Coastguard Worker
185*9880d681SAndroid Build Coastguard Worker.. image:: ARM-BE-bitcastsuccess.png
186*9880d681SAndroid Build Coastguard Worker    :align: right
187*9880d681SAndroid Build Coastguard Worker
188*9880d681SAndroid Build Coastguard WorkerConceptually this is simple - we can insert a ``REV`` undoing the ``LD1`` of type ``X`` (converting the in-register representation to the same as if it had been loaded by ``LDR``) and then insert another ``REV`` to change the representation to be as if it had been loaded by an ``LD1`` of type ``Y``.
189*9880d681SAndroid Build Coastguard Worker
190*9880d681SAndroid Build Coastguard WorkerFor the previous example, this would be::
191*9880d681SAndroid Build Coastguard Worker
192*9880d681SAndroid Build Coastguard Worker    LD1   v0.4s, [x]
193*9880d681SAndroid Build Coastguard Worker
194*9880d681SAndroid Build Coastguard Worker    REV64 v0.4s, v0.4s                  // There is no REV128 instruction, so it must be synthesizedcd
195*9880d681SAndroid Build Coastguard Worker    EXT   v0.16b, v0.16b, v0.16b, #8    // with a REV64 then an EXT to swap the two 64-bit elements.
196*9880d681SAndroid Build Coastguard Worker
197*9880d681SAndroid Build Coastguard Worker    REV64 v0.2d, v0.2d
198*9880d681SAndroid Build Coastguard Worker    EXT   v0.16b, v0.16b, v0.16b, #8
199*9880d681SAndroid Build Coastguard Worker
200*9880d681SAndroid Build Coastguard Worker    ST1   v0.2d, [y]
201*9880d681SAndroid Build Coastguard Worker
202*9880d681SAndroid Build Coastguard WorkerIt turns out that these ``REV`` pairs can, in almost all cases, be squashed together into a single ``REV``. For the example above, a ``REV128 4s`` + ``REV128 2d`` is actually a ``REV64 4s``, as shown in the figure on the right.
203*9880d681SAndroid Build Coastguard Worker
204*9880d681SAndroid Build Coastguard Worker.. [1] One lane vectors may seem useless as a concept but they serve to distinguish between values held in general purpose registers and values held in NEON/VFP registers. For example, an ``i64`` would live in an ``x`` register, but ``<1 x i64>`` would live in a ``d`` register.
205*9880d681SAndroid Build Coastguard Worker
206