1=====================
2Adreno Five Microcode
3=====================
4
5.. contents::
6
7.. _afuc-introduction:
8
9Introduction
10============
11
12Adreno GPUs prior to 6xx use two micro-controllers to parse the command-stream,
13setup the hardware for draws (or compute jobs), and do various GPU
14housekeeping. They are relatively simple (basically glorified
15register writers) and basically all their state is in a collection
16of registers. Ie. there is no stack, and no memory assigned to
17them; any global state like which bank of context registers is to
18be used in the next draw is stored in a register.
19
20The setup is similar to radeon, in fact Adreno 2xx thru 4xx used
21basically the same instruction set as r600. There is a "PFP"
22(Prefetch Parser) and "ME" (Micro Engine, also confusingly referred
23to as "PM4"). These make up the "CP" ("Command Parser"). The
24PFP runs ahead of the ME, with some PM4 packets handled entirely
25in the PFP. Between the PFP and ME is a FIFO ("MEQ"). In the
26generations prior to Adreno 5xx, the PFP and ME had different
27instruction sets.
28
29Starting with Adreno 5xx, a new microcontroller with a unified
30instruction set was introduced, although the overall architecture
31and purpose of the two microcontrollers remains the same.
32
33For lack of a better name, this new instruction set is called
34"Adreno Five MicroCode" or "afuc". (No idea what Qualcomm calls
35it internally).
36
37With Adreno 6xx, the separate PFP and ME are replaced with a single
38SQE microcontroller using the same instruction set as 5xx.
39
40Starting with Adreno 660, another processor called LPAC (Low Priority
41Asynchronous Compute) is introduced which is a slightly cut-down copy of the
42SQE used to execute background compute tasks. Unlike on 5xx, the firmware is
43bundled together with the main SQE firmware, and the SQE is responsible for
44booting LPAC. On 7xx, to implement concurrent binning the SQE is split into two
45processors called BR and BV. Again, the firmware for all three is bundled
46together and BR is responsible for booting both BV and LPAC.
47
48.. _afuc-overview:
49
50Instruction Set Overview
51========================
52
53The afuc instruction set is heavily inspired by MIPS, but not exactly
54compatible.
55
56Registers
57=========
58
59Similar to MIPS, there are 32 registers, and some are special purpose. ``$00``
60is the same as ``$zero`` on MIPS, it reads 0 and writes are discarded.
61
62Registers are displayed in the current disassembly with a hexadecimal
63numbering, e.g. ``$0a`` is encoded as 10.
64
65The ABI used when processing packets is that ``$01`` contains the current PM4
66header, registers from ``$02`` up to ``$11`` are temporaries and may be freely
67clobbered by the packet handler, while ``$12`` and above are used to store
68global state like the IB level and next visible draw (for draw skipping).
69
70Unlike in MIPS, there is a special small hardware-managed stack and special
71instructions ``call``/``ret`` which use it. The stack only contains return
72addresses, there is no "stack frame" to spill values to. As a result, ``$sp``,
73``$fp``, and ``$ra`` don't exist as on MIPS. Instead the last 3 registers are
74used to :ref:`afuc-read<read>` from various queues and
75:ref:`afuc-reg-writes<write GPU registers>`. In addition there is a ``$rem``
76register which normally contains the number of words remaining in the packet
77but can also be used as a normal register in combination with the rep prefix.
78
79.. _afuc-alu:
80
81ALU Instructions
82================
83
84The following instructions are available:
85
86- ``add`` - add
87- ``addhi`` - add + carry (for upper 32b of 64b value)
88- ``sub`` - subtract
89- ``subhi`` - subtract + carry (for upper 32b of 64b value)
90- ``and`` - bitwise AND
91- ``or`` - bitwise OR
92- ``xor`` - bitwise XOR
93- ``not`` - bitwise NOT (no src1)
94- ``shl`` - shift-left
95- ``ushr`` - unsigned shift-right
96- ``ishr`` - signed shift-right
97- ``rot`` - rotate-left (like shift-left with wrap-around)
98- ``mul8`` - multiply low 8b of two src
99- ``min`` - minimum
100- ``max`` - maximum
101- ``cmp`` - compare two values
102
103Similar to MIPS, The ALU instructions can take either two src registers, or a
104src plus 16b immediate as 2nd src, ex::
105
106 add $dst, $src, 0x1234 ; src2 is immed
107 add $dst, $src1, $src2 ; src2 is reg
108
109The ``not`` instruction only takes a single source::
110
111 not $dst, $src
112 not $dst, 0x1234
113
114One departure from MIPS is that there is a special immediate-form ``mov``
115instruction that can shift the 16-bit immediate by a given amount::
116
117 mov $dst, 0x1234 << 2
118
119This replaces ``lui`` on MIPS (just use a shift of 16) while also allowing the
120quick construction of small bitfields, which comes in handy in various places.
121
122.. _afuc-alu-cmp:
123
124The ``cmp`` instruction returns:
125
126- ``0x00`` if src1 > src2
127- ``0x2b`` if src1 == src2
128- ``0x1e`` if src1 < src2
129
130See explanation in :ref:`afuc-branch`
131
132
133.. _afuc-branch:
134
135Branch Instructions
136===================
137
138The following branch/jump instructions are available:
139
140- ``brne`` - branch if not equal (or bit not set)
141- ``breq`` - branch if equal (or bit set)
142- ``jump`` - unconditional jump
143
144Both ``brne`` and ``breq`` have two forms, comparing the src register
145against either a small immediate (up to 5 bits) or a specific bit::
146
147 breq $src, b3, #somelabel ; branch if src & (1 << 3)
148 breq $src, 0x3, #somelabel ; branch if src == 3
149
150The branch instructions are encoded with a 16b relative offset.
151Since ``$00`` always reads back zero, it can be used to construct
152an unconditional relative jump.
153
154The :ref:`cmp <afuc-alu-cmp>` instruction can be paired with the
155bit-test variants of ``brne``/``breq`` to implement gt/ge/lt/le,
156due to the bit pattern it returns, for example::
157
158 cmp $04, $02, $03
159 breq $04, b1, #somelabel
160
161will branch if ``$02`` is less than or equal to ``$03``.
162
163Delay slots
164-----------
165
166Branch instructions have a delay slot so the following instruction is always
167executed regardless of whether branch is taken or not. Unlike MIPS, a branch in
168the delay slot is legal as long as the original branch and the branch in its
169delay slot are never both taken. Because jump tables are awkward and slow due
170to the lack of memory caching, this is often exploited to create dense
171sequences of branches to implement switch-case constructs::
172
173 breq $02, 0x1, #foo
174 breq $02, 0x2, #bar
175 breq $02, 0x3, #baz
176 ...
177 nop
178 jump #default
179
180Another common use of a branch in a delay slot is a double-jump (jump to one
181location if a condition is true, and another location if false). In MIPS this
182requires two delay slots::
183
184 beq $t0, 0x1, #foo
185 nop ; beq delay slot
186 b #bar
187 nop ; b delay slot
188
189In afuc this only requires a delay slot for the second branch::
190
191 breq $02, 0x1, #foo
192 brne $02, 0x1, #bar
193 nop
194
195Note that for the second branch we had to use a conditional branch with the
196opposite condition instead of an unconditional branch as in the MIPS example,
197to guarantee that at most one is ever taken.
198
199.. _afuc-call:
200
201Call/Return
202===========
203
204Simple subroutines can be implemented with ``call``/``ret``. The
205jump instruction encodes a fixed offset from the SQE instruction base.
206
207 TODO not sure how many levels deep function calls can be nested.
208 There isn't really a stack. Definitely seems to be multiple
209 levels of fxn call, see in PFP: CP_CONTEXT_SWITCH_YIELD -> f13 ->
210 f22.
211
212.. _afuc-nop:
213
214NOPs
215====
216
217Afuc has a special NOP encoding where the low 24 bits are ignored by the
218processor. On a5xx the high 8 bits are ``00``, on a6xx they are ``01``
219(probably to make sure that 0 is not a legal instruction, increasing the
220chances of halting immediately when something is misconfigured). This is used
221sometimes to create a "payload" that is ignored when executed. For example, the
222first 2 instructions of the firmware typically contain the firmware ID and
223version followed by the packet handling table offset encoded as NOPs. They are
224skipped when executed but they are later read as data by the bootstrap routine.
225
226.. _afuc-control:
227
228Control Registers
229=================
230
231Control registers are a special register space that can only be read/written
232directly by CP through ``cread``/``cwrite`` instructions::
233
234- ``cread $dst, [$off + addr], flags``
235- ``cread $dst, [$off + addr]!, flags``
236- ``cwrite $src, [$off + addr], flags``
237- ``cwrite $src, [$off + addr]!, flags``
238
239Control registers ``0x000`` to ``0x0ff`` are private registers used to control
240the CP, for example to indicate where to read from memory or (normal)
241registers. ``0x100`` to ``0x17f`` are a private scratch space used by the
242firmware however it wants, for example as an ad-hoc stack to spill registers
243when calling a function or to store the scratch used in ``CP_SCRATCH_TO_*``
244packets. Starting with the introduction of LPAC, ``0x200`` to ``0x27f`` are a
245shared scratch space used to communicate between processors and on a7xx they
246can also be written on event completion to implement so-called "on-chip
247timestamps".
248
249In cases where no offset is needed, ``$00`` is frequently used as the offset.
250
251The addressing mode with ``!`` is a pre-increment mode that writes the final
252address ``$off + addr`` to ``$off``.
253
254For example, the following sequences sets::
255
256 ; load CP_INDIRECT_BUFFER parameters from cmdstream:
257 mov $02, $data ; low 32b of IB target address
258 mov $03, $data ; high 32b of IB target
259 mov $04, $data ; IB size in dwords
260
261 ; sanity check # of dwords:
262 breq $04, 0x0, #l23
263
264 ; this seems something to do with figuring out whether
265 ; we are going from RB->IB1 or IB1->IB2 (ie. so the
266 ; below cwrite instructions update either
267 ; CP_IB1_BASE_LO/HI/BUFSIZE or CP_IB2_BASE_LO/HI/BUFSIZE
268 and $05, $18, 0x0003
269 shl $05, $05, 0x0002
270
271 ; update CP_IBn_BASE_LO/HI/BUFSIZE:
272 cwrite $02, [$05 + 0x0b0], 0x8
273 cwrite $03, [$05 + 0x0b1], 0x8
274 cwrite $04, [$05 + 0x0b2], 0x8
275
276Unlike normal GPU registers, writing control registers seems to always take
277effect immediately; if writing a control register triggers some complex
278operation that the firmware needs to wait for, then it typically uses a
279spinloop with another control register to wait for it to finish.
280
281Control registers are documented in ``adreno_control_regs.xml``. The
282disassembler will try to recognize an immediate address as a known control
283register and print it, for example this sequence similar to the above sequence
284but on a6xx::
285
286 and $05, $12, 0x0003
287 shl $05, $05, 0x0002
288 cwrite $0e, [$05 + @IB1_BASE], 0x0
289 cwrite $0b, [$05 + @IB1_BASE+0x1], 0x0
290 cwrite $04, [$05 + @IB1_DWORDS], 0x0
291
292.. _afuc-sqe-regs:
293
294SQE Registers
295=============
296
297Starting with a6xx, the state of the SQE processor itself can be accessed
298through ``sread``/``swrite`` instructions that work identically to
299``cread``/``cwrite``. For example, this includes the state of the
300``call``/``ret`` stack. This is mainly used during the preemption routine but
301it's also used to set the entrypoint for preemption.
302
303.. _afuc-read:
304
305Reading Memory and Registers
306============================
307
308The CP accesses memory directly with no caching. This means that except for
309very small amounts of data accessed rarely, ``load`` and ``store`` are very
310slow. Instead, ME/PFP and later SQE read memory through various queues. Reading
311registers also use a queue, likely because burst reading several registers at
312once is faster than reading them one-by-one and reading does not complete
313immediately. Queueing up a read involves writing a (address, length) pair to a
314control register, and data is read from the queue using one of three special
315registers:
316
317- ``$data`` reads the next PM4 packet word. This comes from the RB, IB1, IB2,
318 or SDS (Set Draw State) queue, controlled by ``@IB_LEVEL``. It also
319 decrements ``$rem`` if it isn't already decremented by a rep prefix.
320- ``$memdata`` reads the next word from a memory read buffer (MRB) setup by
321 writing ``@MEM_READ_ADDR``/``@MEM_READ_DWORDS``. It's used by things like
322 ``CP_MEMCPY`` and reading indirect draw parameters in ``CP_DRAW_INDIRECT``.
323- ``$regdata`` reads from a register read buffer (RRB) setup by
324 ``@REG_READ_ADDR``/``@REG_READ_DWORDS``.
325
326RB, IB1, IB2, SDS, and MRB make up the Read-Only Queue or ROQ, in addition to
327the Visibility Stream Decoder (VSD) which is setup via a similar control
328register pair but is read by a fixed-function parser that the CP accesses via a
329few control registers.
330
331.. _afuc-reg-writes:
332
333Writing Registers
334=================
335
336The same special registers, when used as a destination, can be used to
337write GPU registers on ME. Because they have a totally different function when
338used as a destination, they use different names:
339
340- ``$addr`` sets the address and disables ``CP_PROTECT`` address checking.
341- ``$usraddr`` sets the address and checks it against the ``CP_PROTECT`` access
342 table. It's used for addresses specified by the PM4 packet stream instead of
343 internally.
344- ``$data`` writes the register and auto-increments the address.
345
346for example, to write::
347
348 mov $addr, CP_SCRATCH_REG[0x2] ; set register to write
349 mov $data, $03 ; CP_SCRATCH_REG[0x2]
350 mov $data, $04 ; CP_SCRATCH_REG[0x3]
351 ...
352
353subsequent writes to ``$data`` will increment the address of the register
354to write, so a sequence of consecutive registers can be written. On a5xx ME,
355this will directly write the register, on a6xx SQE this will instead determine
356which cluster(s) the register belongs to and push the write onto the
357appropriate per-cluster queue(s) letting the SQE run ahead of the GPU.
358
359When bit 18 of ``$addr`` is set, the auto-incrementing is disabled. This is
360often used with :ref:`afuc-mem-writes <NRT_DATA>`.
361
362On a5xx ME, ``$regdata`` can also be used to directly read a register::
363
364 mov $addr, CP_SCRATCH_REG[0x2]
365 mov $03, $regdata
366 mov $04, $regdata
367
368This does not exist on a6xx because register reads are not synchronized against
369writes any more.
370
371Many registers that are updated frequently have two banks, so they can be
372updated without stalling for previous draw to finish. On a5xx, these banks are
373arranged so bit 11 is zero for bank 0 and 1 for bank 1. The ME fw (at
374least the version I'm looking at) stores this in ``$17``, so to update these
375registers from ME::
376
377 or $addr, $17, VFD_INDEX_OFFSET
378 mov $data, $03
379 ...
380
381On a6xx this is handled transparently to the SQE, and the bank to use is stored
382separately in the cluster queue.
383
384Registers can also be written directly, skipping the queue, by writing
385``@REG_WRITE_ADDR``/``@REG_WRITE``. This is used on a6xx for certain frontend
386registers that have their own queues and on a5xx is used by the PFP::
387
388 mov $0c, CP_SCRATCH_REG[0x7]
389 mov $02, 0x789a ; value
390 cwrite $0c, [$00 + @REG_WRITE_ADDR], 0x8
391 cwrite $02, [$00 + @REG_WRITE], 0x8
392
393Like with the ``$addr``/``$data`` approach, the destination register address
394increments on each write to ``@REG_WRITE``.
395
396.. _afuc-pipe-regs:
397
398Pipe Registers
399--------------
400
401This yet another private register space, triggered by writing to the high 8
402bits of ``$addr`` and then writing ``$data`` normally. Some pipe registers like
403``WAIT_MEM_WRITES`` or ``WAIT_GPU_IDLE`` have no data and a write is triggered
404immediately when ``$addr`` is written, for example in ``CP_WAIT_MEM_WRITES``::
405
406 mov $addr, 0x0084 << 24 ; |WAIT_MEM_WRITES
407
408The pipe register is decoded here by the disassembler in a comment.
409
410The main difference of pipe registers from control registers are:
411
412- They are always write-only.
413- On a6xx they are pipelined together with normal register writes, on a5xx they
414 are written from ME like normal registers.
415- Writing them can take an arbitrary amount of time, so they can be used to
416 wait for some condition without spinning.
417
418In short, they behave more like normal registers but are not expected to be
419read/written by anything other than CP. Over time more and more GPU registers
420not touched by the kernel driver have been converted to pipe registers.
421
422.. _afuc-mem-writes:
423
424Writing Memory
425==============
426
427Writing memory is done by writing GPU registers:
428
429- ``CP_ME_NRT_ADDR_LO``/``_HI`` - write to set the address to read or write
430- ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR``.
431
432The address register increments with successive writes.
433
434On a5xx, this seems to be only used by ME. If PFP were also using it, they would
435race with each other. It can also be used for reads, primarily small reads.
436
437Memory Write example::
438
439 ; store 64b value in $04+$05 to 64b address in $02+$03
440 mov $addr, CP_ME_NRT_ADDR_LO
441 mov $data, $02
442 mov $data, $03
443 mov $addr, CP_ME_NRT_DATA
444 mov $data, $04
445 mov $data, $05
446
447Memory Read example::
448
449 ; load 64b value from address in $02+$03 into $04+$05
450 mov $addr, CP_ME_NRT_ADDR_LO
451 mov $data, $02
452 mov $data, $03
453 mov $04, $addr
454 mov $05, $addr
455
456On a6xx ``CP_ME_NRT_ADDR`` and ``CP_ME_NRT_DATA`` have been replaced by
457:ref:`afuc-pipe-regs <pipe registers>` and they can only be used for writes but
458it otherwise works similarly.
459
460Load and Store Instructions
461===========================
462
463a6xx adds ``load`` and ``store`` instruction that work similarly to ``cread``
464and ``cwrite``. Because the address is 64-bits but registers are 32-bit, the
465high 32 bits come from the ``@LOAD_STORE_HI``
466:ref:`afuc-control <control register>`. They are mostly used by the context
467switch routine and even then very sparingly, before the memory read/write queue
468state is saved while it is being restored.
469
470Modifiers
471=========
472
473There are two modifiers that enable more compact and efficient implementations
474of common patterns:
475
476.. _afuc-rep:
477
478Repeat
479------
480
481``(rep)`` repeats the same instruction ``$rem`` times. More precisely, it
482decrements ``$rem`` after the instruction executes if it wasn't already
483decremented from a read from ``$data`` and re-executes the instruction until
484``$rem`` is 0. It can be used with ALU instructions and control instructions.
485Usually it is used in conjunction with ``$data`` to read the rest of the packet
486in one instruction, but it can also be used freestanding, for example this
487snippet clears the control register scratch space::
488
489 mov $rem, 0x0080 ; clear 0x80 registers
490 mov $03, 0x00ff ; start at 0xff + 1 = 0x100
491 (rep)cwrite $00, [$03 + 0x001], 0x4
492
493Note the use of pre-increment mode, so that the first execution clears
494``0x100`` and updates ``$03`` to ``0x100``, the second execution clears
495``0x101`` and updates ``$03`` to ``0x101``, and so on.
496
497.. _afuc-xmov:
498
499eXtra Moves
500-----------
501
502``(xmovN)`` is an optimization which lets the firmware read multiple words from
503a queue in the same cycle. Conceptually, it adds "extra" mov instructions to be
504executed after a given ALU instruction, although in practice they are likely
505executed in parallel. ``(xmov1)`` adds up to 1 move, ``(xmov2)`` adds up to 2,
506and ``(xmov3)`` adds up to 3. The actual number of moves added is the minimum
507of the number in the instruction and ``$rem``, so a ``(xmov3)`` instruction
508behaves like a ``(xmov1)`` instruction if ``$rem = 1``. Given an instruction::
509
510 (xmovN) alu $dst, $src1, $src2
511
512or a 1-source instruction::
513
514 (xmovN) alu $dst, $src2
515
516then we compute the number of extra moves ``M = min(N, $rem)``. If ``M = 1``,
517then we add::
518
519 mov $data, $src2
520
521If ``M = 2``, then we add::
522
523 mov $data, $src2
524 mov $data, $src2
525
526Finally, as a special case explained below, if ``M = 3`` then we add::
527
528 mov $data, $src2
529 mov $dst, $src2 ; !!!
530 mov $data, $src2
531
532If ``$dst`` is not one of the "special" registers ``$data``, ``$addr``,
533``$usraddr``, then ``$data`` is replaced by ``$00`` in all destinations, i.e.
534the results of the subsequent moves are discarded.
535
536The purpose of the ``M = 3`` special case is mostly to efficiently implement
537``CP_CONTEXT_REG_BUNCH``. This is the entire implementation of
538``CP_CONTEXT_REG_BUNCH``, which is essentially just one instruction::
539
540 CP_CONTEXT_REG_BUNCH:
541 (rep)(xmov3)mov $usraddr, $data
542 waitin
543 mov $01, $data
544
545If there are 4 or more words remaining in the packet, that is if there are at
546least two more registers to write, then (ignoring the ``(rep)`` for a moment)
547the instruction expands to::
548
549 mov $usraddr, $data
550 mov $data, $data
551 mov $usraddr, $data
552 mov $data, $data
553
554This is likely all executed in a single cycle, allowing us to write 2 registers
555per cycle.
556
557``(xmov1)`` can be also added to ``(rep)mov $data, $data``, which is a common
558pattern to write the rest of the packet to successive registers, to write up to
5592 registers per cycle as well. The firmware does not use ``(xmov3)``, however,
560so 2 registers per cycle is likely a hardware limitation.
561
562Although ``(xmovN)`` is often used in combination with ``(rep)``, it doesn't
563have to be. For example, ``(xmov1)mov $data, $data`` moves the next 2 packet
564words to 2 successive registers.
565
566.. _afuc-sds:
567
568Set Draw State
569--------------
570
571``(sdsN)`` is a modifier for ``cwrite`` used to accelerate
572``CP_SET_DRAW_STATE``. For each draw state group to update,
573``CP_SET_DRAW_STATE`` needs to copy 3 words from the packet containing the
574group to update, metadata, and base address plus size. Using the ``(sds2)``
575modifier as well as ``(rep)``, this can be accomplished in a single
576instruction::
577
578 (rep)(sds2)cwrite $data, [$00 + @DRAW_STATE_SET_HDR]
579
580The first word containing the header is written to ``@DRAW_STATE_SET_HDR``, and
581the second and third words containing the draw state base come from reading the
582source again twice and are written directly to the draw state RAM.
583
584In testing with other control registers, ``(sdsN)`` causes the source to be
585read ``N`` extra times and then thrown away. Only when used in combination with
586``@DRAW_STATE_SET_HDR`` do the extra source reads have an effect.
587
588.. _afuc-peek:
589
590Peek
591----
592
593``(peek)`` is valid on ALU instructions without an immediate. It modifies what
594``$data`` (and possibly ``$memdata`` and ``$regdata``) do by making them avoid
595consuming the word. The next read to ``$data`` will return the same thing. This
596is used solely by ``CP_INDIRECT_BUFFER`` to test if there is a subsequent IB
597that can be prefetched while the first IB is executed without actually
598consuming the header for the next packet. It is introduced on a7xx, and
599replaces the use of a special control register.
600
601Packet Table
602============
603
604The core of the microprocessor's job is to parse each packet header and jump to
605its handler. This is done through a ``waitin`` instruction which waits for the
606packet header to become available and then parses the header and jumps to the
607handler using a jump table. However it does *not* actually consume the header.
608Like any branch instruction, it has a delay slot, and by convention this delay
609slot always contains a ``mov $01, $data`` instruction. This consumes the same
610header that ``waitin`` parsed and puts it in ``$01`` so that the packet header
611is available in ``$01`` in the next packet. Thus all packet handlers end with
612this sequence::
613
614 waitin
615 mov $01, $data
616
617The jump table itself is initialized by the SQE in the bootstrap routine at the
618beginning of the firmware. Amongst other tasks, it reads the offset of the jump
619table from the NOP payload at the beginning, then uses a jump table embedded at
620the end of the firmware to set it up by writing the ``@PACKET_TABLE_WRITE``
621control register. After everything is setup, it does the ``waitin`` sequence
622to start handling the first packet (which should be ``CP_ME_INIT``).
623
624Example Packet
625==============
626
627Let's examine an implementation of ``CP_MEM_WRITE``::
628
629 CP_MEM_WRITE:
630 mov $addr, 0x00a0 << 24 ; |NRT_ADDR
631
632First, we setup the register to write to, which is the ``NRT_ADDR``
633:ref:`afuc-pipe-regs <pipe register>`. It turns out that the low 2 bits of
634``NRT_ADDR`` are a flag which when 1 disables auto-incrementing ``NRT_ADDR``
635when ``NRT_DATA`` is written, but we don't want this behavior so we have to
636make sure they are clear::
637
638 or $02, $data, 0x0003 ; reading $data reads the next PM4 word
639 xor $data, $02, 0x0003 ; writing $data writes the register, which is NRT_ADDR
640
641Writing ``$data`` auto-increments ``$addr``, so now the next write is to
642``0xa1`` or ``NRT_ADDR+1`` (``NRT_ADDR`` is a 64-bit register)::
643
644 mov $data, $data
645
646Now, we have to write ``NRT_DATA``. We want to repeatedly write the same
647register, without having to fight the auto-increment by resetting ``$addr``
648each time, which is where the bit 18 that disables auto-increment comes in
649handy::
650
651 mov $addr, 0xa204 << 16 ; |NRT_DATA
652
653Finally, we have to repeatedly copy the remaining PM4 packet data to the
654``NRT_DATA`` register, which we can do in one instruction with
655:ref:`afuc-rep <(rep)>`. Furthermore we can use :ref:`afuc-xmov <(xmov1)>` to
656squeeze out some more performance::
657
658 (rep)(xmov1)mov $data, $data
659
660At the end is the standard go-to-next-packet sequence::
661
662 waitin
663 mov $01, $data
664
665Reassembling Firmwares
666======================
667
668Of course, the main use of assembling is to take the firmware you're using,
669modify it to test something, and reassemble it. Reassembling a firmware should
670work out-of-the-box, and should give you back an identical firmware, but there
671is a caveat if you want to reassemble a modified firmware and use preemption.
672The preemption routines contain a few tables embedded in the firmware, and they
673load the offset of the table with a ``mov`` instruction that needs to be turned
674into a relocation and then add it to ``CP_SQE_INSTR_BASE``. ``afuc-asm``
675supports using labels as immediates for this::
676
677 foo:
678 [00000000]
679 ...
680
681 mov $02, #foo << 2 ; #foo will be replaced with the offset in words
682
683However, you have to manually insert the labels and replace the constant. On
684a7xx there are multiple tables next to each other that look like one table, so
685be careful to make sure you've found all the places it offsets from
686``CP_SQE_INSTR_BASE``! There are also tables in the BV microcode on a7xx. To
687check that the relocations are correct, check that reassembling an otherwise
688unmodified firmware still gives an identical result after adding the
689relocations.
690
691A6XX NOTES
692==========
693
694The ``$14`` register holds global flags set by:
695
696 CP_SKIP_IB2_ENABLE_LOCAL - b8
697 CP_SKIP_IB2_ENABLE_GLOBAL - b9
698 CP_SET_MARKER
699 MODE=GMEM - sets b15
700 MODE=BLIT2D - clears b15, b12, b7
701 CP_SET_MODE - b29+b30
702 CP_SET_VISIBILITY_OVERRIDE - b11, b21, b30?
703 CP_SET_DRAW_STATE - checks b29+b30
704
705 CP_COND_REG_EXEC - checks b10, which should be predicate flag?
706