1*61046927SAndroid Build Coastguard WorkerInstancing 2*61046927SAndroid Build Coastguard Worker========== 3*61046927SAndroid Build Coastguard Worker 4*61046927SAndroid Build Coastguard WorkerThe attribute descriptor lets the attribute unit compute the address of an 5*61046927SAndroid Build Coastguard Workerattribute given the vertex and instance ID. Unfortunately, the way this works is 6*61046927SAndroid Build Coastguard Workerrather complicated when instancing is enabled. 7*61046927SAndroid Build Coastguard Worker 8*61046927SAndroid Build Coastguard WorkerTo explain this, first we need to explain how compute and vertex threads are 9*61046927SAndroid Build Coastguard Workerdispatched. When a quad is dispatched, it receives a single, linear index. 10*61046927SAndroid Build Coastguard WorkerHowever, we need to translate that index into a (vertex id, instance id) pair. 11*61046927SAndroid Build Coastguard WorkerOne option would be to do: 12*61046927SAndroid Build Coastguard Worker 13*61046927SAndroid Build Coastguard Worker.. math:: 14*61046927SAndroid Build Coastguard Worker \text{vertex id} = \text{linear id} \% \text{num vertices} 15*61046927SAndroid Build Coastguard Worker 16*61046927SAndroid Build Coastguard Worker \text{instance id} = \text{linear id} / \text{num vertices} 17*61046927SAndroid Build Coastguard Worker 18*61046927SAndroid Build Coastguard Workerbut this involves a costly division and modulus by an arbitrary number. 19*61046927SAndroid Build Coastguard WorkerInstead, we could pad ``num_vertices``. We dispatch 20*61046927SAndroid Build Coastguard Worker:math:`\text{padded_num_vertices} \cdot \text{num_instances}` threads instead 21*61046927SAndroid Build Coastguard Workerof :math:`\text{num_vertices} \cdot \text{num_instances}`, which results 22*61046927SAndroid Build Coastguard Workerin some "extra" threads with :math:`\text{vertex_id} \geq \text{num_vertices}`, 23*61046927SAndroid Build Coastguard Workerwhich we have to discard. The more we pad ``num_vertices``, the more "wasted" 24*61046927SAndroid Build Coastguard Workerthreads we dispatch, but the division is potentially easier. 25*61046927SAndroid Build Coastguard Worker 26*61046927SAndroid Build Coastguard WorkerOne straightforward choice is to pad ``num_vertices`` to the next power 27*61046927SAndroid Build Coastguard Workerof two, which means that the division and modulus are just simple bit shifts 28*61046927SAndroid Build Coastguard Workerand masking. But the actual algorithm is a bit more complicated. The thread 29*61046927SAndroid Build Coastguard Workerdispatcher has special support for dividing by 3, 5, 7, and 9, in addition 30*61046927SAndroid Build Coastguard Workerto dividing by a power of two. As a result, ``padded_num_vertices`` can 31*61046927SAndroid Build Coastguard Workerbe 1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads, 32*61046927SAndroid Build Coastguard Workersince we need less padding. 33*61046927SAndroid Build Coastguard Worker 34*61046927SAndroid Build Coastguard Worker``padded_num_vertices`` is picked by the hardware. The driver just specifies 35*61046927SAndroid Build Coastguard Workerthe actual number of vertices. Note that ``padded_num_vertices`` is a multiple 36*61046927SAndroid Build Coastguard Workerof four (presumably because threads are dispatched in groups of 4). Also, 37*61046927SAndroid Build Coastguard Worker``padded_num_vertices`` is always at least one more than ``num_vertices``, 38*61046927SAndroid Build Coastguard Workerwhich seems like a quirk of the hardware. For larger ``num_vertices``, the 39*61046927SAndroid Build Coastguard Workerhardware uses the following algorithm: using the binary representation of 40*61046927SAndroid Build Coastguard Worker``num_vertices``, we look at the most significant set bit as well as the 41*61046927SAndroid Build Coastguard Workerfollowing 3 bits. Let n be the number of bits after those 4 bits. Then we 42*61046927SAndroid Build Coastguard Workerset ``padded_num_vertices`` according to the following table: 43*61046927SAndroid Build Coastguard Worker 44*61046927SAndroid Build Coastguard Worker========== ======================= 45*61046927SAndroid Build Coastguard Workerhigh bits ``padded_num_vertices`` 46*61046927SAndroid Build Coastguard Worker========== ======================= 47*61046927SAndroid Build Coastguard Worker1000 :math:`9 \cdot 2^n` 48*61046927SAndroid Build Coastguard Worker1001 :math:`5 \cdot 2^{n+1}` 49*61046927SAndroid Build Coastguard Worker101x :math:`3 \cdot 2^{n+2}` 50*61046927SAndroid Build Coastguard Worker110x :math:`7 \cdot 2^{n+1}` 51*61046927SAndroid Build Coastguard Worker111x :math:`2^{n+4}` 52*61046927SAndroid Build Coastguard Worker========== ======================= 53*61046927SAndroid Build Coastguard Worker 54*61046927SAndroid Build Coastguard WorkerFor example, if :math:`\text{num_vertices} = 70` is passed to 55*61046927SAndroid Build Coastguard Worker:c:func:`glDraw()`, its binary representation is 1000110, so :math:`n = 3` 56*61046927SAndroid Build Coastguard Workerand the high bits are 1000, and therefore 57*61046927SAndroid Build Coastguard Worker:math:`\text{padded_num_vertices} = 9 \cdot 2^3 = 72`. 58*61046927SAndroid Build Coastguard Worker 59*61046927SAndroid Build Coastguard WorkerThe attribute unit works in terms of the original ``linear_id``. if 60*61046927SAndroid Build Coastguard Worker:math:`\text{num_instances} = 1`, then they are the same, and everything 61*61046927SAndroid Build Coastguard Workeris simple. However, with instancing things get more complicated. There are 62*61046927SAndroid Build Coastguard Workerfour possible modes, two of them we can group together: 63*61046927SAndroid Build Coastguard Worker 64*61046927SAndroid Build Coastguard Worker1. Use the ``linear_id`` directly. Only used when there is no instancing. 65*61046927SAndroid Build Coastguard Worker 66*61046927SAndroid Build Coastguard Worker2. Use the ``linear_id`` modulo a constant. This is used for per-vertex 67*61046927SAndroid Build Coastguard Workerattributes with instancing enabled by making the constant equal 68*61046927SAndroid Build Coastguard Worker``padded_num_vertices``. Because the modulus is always ``padded_num_vertices``, 69*61046927SAndroid Build Coastguard Workerthis mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, 70*61046927SAndroid Build Coastguard Workeror 9. The shift field specifies the power of two, while the ``extra_flags`` 71*61046927SAndroid Build Coastguard Workerfield specifies the odd number. If :math:`\text{shift} = n` and 72*61046927SAndroid Build Coastguard Worker:math:`\text{extra_flags} = m`, then the modulus is 73*61046927SAndroid Build Coastguard Worker:math:`(2m + 1) \cdot 2^n`. As an example, if 74*61046927SAndroid Build Coastguard Worker:math:`\text{num_vertices} = 70`, then as computed above, 75*61046927SAndroid Build Coastguard Worker:math:`\text{padded_num_vertices} = 9 \cdot 2^3`, so we should set 76*61046927SAndroid Build Coastguard Worker:math:`\text{extra_flags} = 4` and :math:`\text{shift} = 3`. Note that we 77*61046927SAndroid Build Coastguard Workermust exactly follow the hardware algorithm used to get ``padded_num_vertices`` 78*61046927SAndroid Build Coastguard Workerin order to correctly implement per-vertex attributes. 79*61046927SAndroid Build Coastguard Worker 80*61046927SAndroid Build Coastguard Worker3. Divide the ``linear_id`` by a constant. In order to correctly implement 81*61046927SAndroid Build Coastguard Workerinstance divisors, we have to divide ``linear_id`` by ``padded_num_vertices`` 82*61046927SAndroid Build Coastguard Workertimes to user-specified divisor. So first we compute ``padded_num_vertices``, 83*61046927SAndroid Build Coastguard Workeragain following the exact same algorithm that the hardware uses, then multiply 84*61046927SAndroid Build Coastguard Workerit by the GL-level divisor to get the hardware-level divisor. This case is 85*61046927SAndroid Build Coastguard Workerfurther divided into two more cases. If the hardware-level divisor is a 86*61046927SAndroid Build Coastguard Workerpower of two, then we just need to shift. The shift amount is specified by 87*61046927SAndroid Build Coastguard Workerthe shift field, so that the hardware-level divisor is just 88*61046927SAndroid Build Coastguard Worker:math:`2^\text{shift}`. 89*61046927SAndroid Build Coastguard Worker 90*61046927SAndroid Build Coastguard WorkerIf it isn't a power of two, then we have to divide by an arbitrary integer. 91*61046927SAndroid Build Coastguard WorkerFor that, we use the well-known technique of multiplying by an approximation 92*61046927SAndroid Build Coastguard Workerof the inverse. The driver must compute the magic multiplier and shift 93*61046927SAndroid Build Coastguard Workeramount, and then the hardware does the multiplication and shift. The 94*61046927SAndroid Build Coastguard Workerhardware and driver also use the "round-down" optimization as described in 95*61046927SAndroid Build Coastguard Workerhttps://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf. 96*61046927SAndroid Build Coastguard WorkerThe hardware further assumes the multiplier is between :math:`2^{31}` and 97*61046927SAndroid Build Coastguard Worker:math:`2^{32}`, so the high bit is implicitly set to 1 even though it is set 98*61046927SAndroid Build Coastguard Workerto 0 by the driver -- presumably this simplifies the hardware multiplier a 99*61046927SAndroid Build Coastguard Workerlittle. The hardware first multiplies ``linear_id`` by the multiplier and 100*61046927SAndroid Build Coastguard Workertakes the high 32 bits, then applies the round-down correction if 101*61046927SAndroid Build Coastguard Worker:math:`\text{extra_flags} = 1`, then finally shifts right by the shift field. 102*61046927SAndroid Build Coastguard Worker 103*61046927SAndroid Build Coastguard WorkerThere are some differences between ridiculousfish's algorithm and the Mali 104*61046927SAndroid Build Coastguard Workerhardware algorithm, which means that the reference code from ridiculousfish 105*61046927SAndroid Build Coastguard Workerdoesn't always produce the right constants. Mali does not use the pre-shift 106*61046927SAndroid Build Coastguard Workeroptimization, since that would make a hardware implementation slower (it 107*61046927SAndroid Build Coastguard Workerwould have to always do the pre-shift, multiply, and post-shift operations). 108*61046927SAndroid Build Coastguard WorkerIt also forces the multiplier to be at least :math:`2^{31}`, which means 109*61046927SAndroid Build Coastguard Workerthat the exponent is entirely fixed, so there is no trial-and-error. 110*61046927SAndroid Build Coastguard WorkerAltogether, given the divisor d, the algorithm the driver must follow is: 111*61046927SAndroid Build Coastguard Worker 112*61046927SAndroid Build Coastguard Worker1. Set :math:`\text{shift} = \lfloor \log_2(d) \rfloor`. 113*61046927SAndroid Build Coastguard Worker2. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`. 114*61046927SAndroid Build Coastguard Worker3. If :math:`e \leq 2^{shift}`, then we need to use the round-down algorithm. 115*61046927SAndroid Build Coastguard Worker Set :math:`\text{magic_divisor} = m - 1` and :math:`\text{extra_flags} = 1`. 116*61046927SAndroid Build Coastguard Worker4. Otherwise, set :math:`\text{magic_divisor} = m` and 117*61046927SAndroid Build Coastguard Worker :math:`\text{extra_flags} = 0`. 118