xref: /aosp_15_r20/external/mesa3d/docs/drivers/panfrost/instancing.rst (revision 6104692788411f58d303aa86923a9ff6ecaded22)
1*61046927SAndroid Build Coastguard WorkerInstancing
2*61046927SAndroid Build Coastguard Worker==========
3*61046927SAndroid Build Coastguard Worker
4*61046927SAndroid Build Coastguard WorkerThe attribute descriptor lets the attribute unit compute the address of an
5*61046927SAndroid Build Coastguard Workerattribute given the vertex and instance ID. Unfortunately, the way this works is
6*61046927SAndroid Build Coastguard Workerrather complicated when instancing is enabled.
7*61046927SAndroid Build Coastguard Worker
8*61046927SAndroid Build Coastguard WorkerTo explain this, first we need to explain how compute and vertex threads are
9*61046927SAndroid Build Coastguard Workerdispatched.  When a quad is dispatched, it receives a single, linear index.
10*61046927SAndroid Build Coastguard WorkerHowever, we need to translate that index into a (vertex id, instance id) pair.
11*61046927SAndroid Build Coastguard WorkerOne option would be to do:
12*61046927SAndroid Build Coastguard Worker
13*61046927SAndroid Build Coastguard Worker.. math::
14*61046927SAndroid Build Coastguard Worker   \text{vertex id} = \text{linear id} \% \text{num vertices}
15*61046927SAndroid Build Coastguard Worker
16*61046927SAndroid Build Coastguard Worker   \text{instance id} = \text{linear id} / \text{num vertices}
17*61046927SAndroid Build Coastguard Worker
18*61046927SAndroid Build Coastguard Workerbut this involves a costly division and modulus by an arbitrary number.
19*61046927SAndroid Build Coastguard WorkerInstead, we could pad ``num_vertices``. We dispatch
20*61046927SAndroid Build Coastguard Worker:math:`\text{padded_num_vertices} \cdot \text{num_instances}` threads instead
21*61046927SAndroid Build Coastguard Workerof :math:`\text{num_vertices} \cdot \text{num_instances}`, which results
22*61046927SAndroid Build Coastguard Workerin some "extra" threads with :math:`\text{vertex_id} \geq \text{num_vertices}`,
23*61046927SAndroid Build Coastguard Workerwhich we have to discard.  The more we pad ``num_vertices``, the more "wasted"
24*61046927SAndroid Build Coastguard Workerthreads we dispatch, but the division is potentially easier.
25*61046927SAndroid Build Coastguard Worker
26*61046927SAndroid Build Coastguard WorkerOne straightforward choice is to pad ``num_vertices`` to the next power
27*61046927SAndroid Build Coastguard Workerof two, which means that the division and modulus are just simple bit shifts
28*61046927SAndroid Build Coastguard Workerand masking. But the actual algorithm is a bit more complicated. The thread
29*61046927SAndroid Build Coastguard Workerdispatcher has special support for dividing by 3, 5, 7, and 9, in addition
30*61046927SAndroid Build Coastguard Workerto dividing by a power of two. As a result, ``padded_num_vertices`` can
31*61046927SAndroid Build Coastguard Workerbe 1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads,
32*61046927SAndroid Build Coastguard Workersince we need less padding.
33*61046927SAndroid Build Coastguard Worker
34*61046927SAndroid Build Coastguard Worker``padded_num_vertices`` is picked by the hardware. The driver just specifies
35*61046927SAndroid Build Coastguard Workerthe actual number of vertices. Note that ``padded_num_vertices`` is a multiple
36*61046927SAndroid Build Coastguard Workerof four (presumably because threads are dispatched in groups of 4). Also,
37*61046927SAndroid Build Coastguard Worker``padded_num_vertices`` is always at least one more than ``num_vertices``,
38*61046927SAndroid Build Coastguard Workerwhich seems like a quirk of the hardware. For larger ``num_vertices``, the
39*61046927SAndroid Build Coastguard Workerhardware uses the following algorithm: using the binary representation of
40*61046927SAndroid Build Coastguard Worker``num_vertices``, we look at the most significant set bit as well as the
41*61046927SAndroid Build Coastguard Workerfollowing 3 bits. Let n be the number of bits after those 4 bits. Then we
42*61046927SAndroid Build Coastguard Workerset ``padded_num_vertices`` according to the following table:
43*61046927SAndroid Build Coastguard Worker
44*61046927SAndroid Build Coastguard Worker==========  =======================
45*61046927SAndroid Build Coastguard Workerhigh bits   ``padded_num_vertices``
46*61046927SAndroid Build Coastguard Worker==========  =======================
47*61046927SAndroid Build Coastguard Worker1000		   :math:`9 \cdot 2^n`
48*61046927SAndroid Build Coastguard Worker1001		   :math:`5 \cdot 2^{n+1}`
49*61046927SAndroid Build Coastguard Worker101x		   :math:`3 \cdot 2^{n+2}`
50*61046927SAndroid Build Coastguard Worker110x		   :math:`7 \cdot 2^{n+1}`
51*61046927SAndroid Build Coastguard Worker111x		   :math:`2^{n+4}`
52*61046927SAndroid Build Coastguard Worker==========  =======================
53*61046927SAndroid Build Coastguard Worker
54*61046927SAndroid Build Coastguard WorkerFor example, if :math:`\text{num_vertices} = 70` is passed to
55*61046927SAndroid Build Coastguard Worker:c:func:`glDraw()`, its binary representation is 1000110, so :math:`n = 3`
56*61046927SAndroid Build Coastguard Workerand the high bits are 1000, and therefore
57*61046927SAndroid Build Coastguard Worker:math:`\text{padded_num_vertices} = 9 \cdot 2^3 = 72`.
58*61046927SAndroid Build Coastguard Worker
59*61046927SAndroid Build Coastguard WorkerThe attribute unit works in terms of the original ``linear_id``. if
60*61046927SAndroid Build Coastguard Worker:math:`\text{num_instances} = 1`, then they are the same, and everything
61*61046927SAndroid Build Coastguard Workeris simple. However, with instancing things get more complicated. There are
62*61046927SAndroid Build Coastguard Workerfour possible modes, two of them we can group together:
63*61046927SAndroid Build Coastguard Worker
64*61046927SAndroid Build Coastguard Worker1. Use the ``linear_id`` directly. Only used when there is no instancing.
65*61046927SAndroid Build Coastguard Worker
66*61046927SAndroid Build Coastguard Worker2. Use the ``linear_id`` modulo a constant. This is used for per-vertex
67*61046927SAndroid Build Coastguard Workerattributes with instancing enabled by making the constant equal
68*61046927SAndroid Build Coastguard Worker``padded_num_vertices``. Because the modulus is always ``padded_num_vertices``,
69*61046927SAndroid Build Coastguard Workerthis mode only supports a modulus that is a power of 2 times 1, 3, 5, 7,
70*61046927SAndroid Build Coastguard Workeror 9. The shift field specifies the power of two, while the ``extra_flags``
71*61046927SAndroid Build Coastguard Workerfield specifies the odd number. If :math:`\text{shift} = n` and
72*61046927SAndroid Build Coastguard Worker:math:`\text{extra_flags} = m`, then the modulus is
73*61046927SAndroid Build Coastguard Worker:math:`(2m + 1) \cdot 2^n`. As an example, if
74*61046927SAndroid Build Coastguard Worker:math:`\text{num_vertices} = 70`, then as computed above,
75*61046927SAndroid Build Coastguard Worker:math:`\text{padded_num_vertices} = 9 \cdot 2^3`, so we should set
76*61046927SAndroid Build Coastguard Worker:math:`\text{extra_flags} = 4` and :math:`\text{shift} = 3`. Note that we
77*61046927SAndroid Build Coastguard Workermust exactly follow the hardware algorithm used to get ``padded_num_vertices``
78*61046927SAndroid Build Coastguard Workerin order to correctly implement per-vertex attributes.
79*61046927SAndroid Build Coastguard Worker
80*61046927SAndroid Build Coastguard Worker3. Divide the ``linear_id`` by a constant. In order to correctly implement
81*61046927SAndroid Build Coastguard Workerinstance divisors, we have to divide ``linear_id`` by ``padded_num_vertices``
82*61046927SAndroid Build Coastguard Workertimes to user-specified divisor. So first we compute ``padded_num_vertices``,
83*61046927SAndroid Build Coastguard Workeragain following the exact same algorithm that the hardware uses, then multiply
84*61046927SAndroid Build Coastguard Workerit by the GL-level divisor to get the hardware-level divisor. This case is
85*61046927SAndroid Build Coastguard Workerfurther divided into two more cases. If the hardware-level divisor is a
86*61046927SAndroid Build Coastguard Workerpower of two, then we just need to shift. The shift amount is specified by
87*61046927SAndroid Build Coastguard Workerthe shift field, so that the hardware-level divisor is just
88*61046927SAndroid Build Coastguard Worker:math:`2^\text{shift}`.
89*61046927SAndroid Build Coastguard Worker
90*61046927SAndroid Build Coastguard WorkerIf it isn't a power of two, then we have to divide by an arbitrary integer.
91*61046927SAndroid Build Coastguard WorkerFor that, we use the well-known technique of multiplying by an approximation
92*61046927SAndroid Build Coastguard Workerof the inverse. The driver must compute the magic multiplier and shift
93*61046927SAndroid Build Coastguard Workeramount, and then the hardware does the multiplication and shift. The
94*61046927SAndroid Build Coastguard Workerhardware and driver also use the "round-down" optimization as described in
95*61046927SAndroid Build Coastguard Workerhttps://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf.
96*61046927SAndroid Build Coastguard WorkerThe hardware further assumes the multiplier is between :math:`2^{31}` and
97*61046927SAndroid Build Coastguard Worker:math:`2^{32}`, so the high bit is implicitly set to 1 even though it is set
98*61046927SAndroid Build Coastguard Workerto 0 by the driver -- presumably this simplifies the hardware multiplier a
99*61046927SAndroid Build Coastguard Workerlittle. The hardware first multiplies ``linear_id`` by the multiplier and
100*61046927SAndroid Build Coastguard Workertakes the high 32 bits, then applies the round-down correction if
101*61046927SAndroid Build Coastguard Worker:math:`\text{extra_flags} = 1`, then finally shifts right by the shift field.
102*61046927SAndroid Build Coastguard Worker
103*61046927SAndroid Build Coastguard WorkerThere are some differences between ridiculousfish's algorithm and the Mali
104*61046927SAndroid Build Coastguard Workerhardware algorithm, which means that the reference code from ridiculousfish
105*61046927SAndroid Build Coastguard Workerdoesn't always produce the right constants. Mali does not use the pre-shift
106*61046927SAndroid Build Coastguard Workeroptimization, since that would make a hardware implementation slower (it
107*61046927SAndroid Build Coastguard Workerwould have to always do the pre-shift, multiply, and post-shift operations).
108*61046927SAndroid Build Coastguard WorkerIt also forces the multiplier to be at least :math:`2^{31}`, which means
109*61046927SAndroid Build Coastguard Workerthat the exponent is entirely fixed, so there is no trial-and-error.
110*61046927SAndroid Build Coastguard WorkerAltogether, given the divisor d, the algorithm the driver must follow is:
111*61046927SAndroid Build Coastguard Worker
112*61046927SAndroid Build Coastguard Worker1. Set :math:`\text{shift} = \lfloor \log_2(d) \rfloor`.
113*61046927SAndroid Build Coastguard Worker2. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`.
114*61046927SAndroid Build Coastguard Worker3. If :math:`e \leq 2^{shift}`, then we need to use the round-down algorithm.
115*61046927SAndroid Build Coastguard Worker   Set :math:`\text{magic_divisor} = m - 1` and :math:`\text{extra_flags} = 1`.
116*61046927SAndroid Build Coastguard Worker4. Otherwise, set :math:`\text{magic_divisor} = m` and
117*61046927SAndroid Build Coastguard Worker   :math:`\text{extra_flags} = 0`.
118