1===================== 2Adreno Five Microcode 3===================== 4 5.. contents:: 6 7.. _afuc-introduction: 8 9Introduction 10============ 11 12Adreno GPUs prior to 6xx use two micro-controllers to parse the command-stream, 13setup the hardware for draws (or compute jobs), and do various GPU 14housekeeping. They are relatively simple (basically glorified 15register writers) and basically all their state is in a collection 16of registers. Ie. there is no stack, and no memory assigned to 17them; any global state like which bank of context registers is to 18be used in the next draw is stored in a register. 19 20The setup is similar to radeon, in fact Adreno 2xx thru 4xx used 21basically the same instruction set as r600. There is a "PFP" 22(Prefetch Parser) and "ME" (Micro Engine, also confusingly referred 23to as "PM4"). These make up the "CP" ("Command Parser"). The 24PFP runs ahead of the ME, with some PM4 packets handled entirely 25in the PFP. Between the PFP and ME is a FIFO ("MEQ"). In the 26generations prior to Adreno 5xx, the PFP and ME had different 27instruction sets. 28 29Starting with Adreno 5xx, a new microcontroller with a unified 30instruction set was introduced, although the overall architecture 31and purpose of the two microcontrollers remains the same. 32 33For lack of a better name, this new instruction set is called 34"Adreno Five MicroCode" or "afuc". (No idea what Qualcomm calls 35it internally). 36 37With Adreno 6xx, the separate PFP and ME are replaced with a single 38SQE microcontroller using the same instruction set as 5xx. 39 40Starting with Adreno 660, another processor called LPAC (Low Priority 41Asynchronous Compute) is introduced which is a slightly cut-down copy of the 42SQE used to execute background compute tasks. Unlike on 5xx, the firmware is 43bundled together with the main SQE firmware, and the SQE is responsible for 44booting LPAC. On 7xx, to implement concurrent binning the SQE is split into two 45processors called BR and BV. Again, the firmware for all three is bundled 46together and BR is responsible for booting both BV and LPAC. 47 48.. _afuc-overview: 49 50Instruction Set Overview 51======================== 52 53The afuc instruction set is heavily inspired by MIPS, but not exactly 54compatible. 55 56Registers 57========= 58 59Similar to MIPS, there are 32 registers, and some are special purpose. ``$00`` 60is the same as ``$zero`` on MIPS, it reads 0 and writes are discarded. 61 62Registers are displayed in the current disassembly with a hexadecimal 63numbering, e.g. ``$0a`` is encoded as 10. 64 65The ABI used when processing packets is that ``$01`` contains the current PM4 66header, registers from ``$02`` up to ``$11`` are temporaries and may be freely 67clobbered by the packet handler, while ``$12`` and above are used to store 68global state like the IB level and next visible draw (for draw skipping). 69 70Unlike in MIPS, there is a special small hardware-managed stack and special 71instructions ``call``/``ret`` which use it. The stack only contains return 72addresses, there is no "stack frame" to spill values to. As a result, ``$sp``, 73``$fp``, and ``$ra`` don't exist as on MIPS. Instead the last 3 registers are 74used to :ref:`afuc-read<read>` from various queues and 75:ref:`afuc-reg-writes<write GPU registers>`. In addition there is a ``$rem`` 76register which normally contains the number of words remaining in the packet 77but can also be used as a normal register in combination with the rep prefix. 78 79.. _afuc-alu: 80 81ALU Instructions 82================ 83 84The following instructions are available: 85 86- ``add`` - add 87- ``addhi`` - add + carry (for upper 32b of 64b value) 88- ``sub`` - subtract 89- ``subhi`` - subtract + carry (for upper 32b of 64b value) 90- ``and`` - bitwise AND 91- ``or`` - bitwise OR 92- ``xor`` - bitwise XOR 93- ``not`` - bitwise NOT (no src1) 94- ``shl`` - shift-left 95- ``ushr`` - unsigned shift-right 96- ``ishr`` - signed shift-right 97- ``rot`` - rotate-left (like shift-left with wrap-around) 98- ``mul8`` - multiply low 8b of two src 99- ``min`` - minimum 100- ``max`` - maximum 101- ``cmp`` - compare two values 102 103Similar to MIPS, The ALU instructions can take either two src registers, or a 104src plus 16b immediate as 2nd src, ex:: 105 106 add $dst, $src, 0x1234 ; src2 is immed 107 add $dst, $src1, $src2 ; src2 is reg 108 109The ``not`` instruction only takes a single source:: 110 111 not $dst, $src 112 not $dst, 0x1234 113 114One departure from MIPS is that there is a special immediate-form ``mov`` 115instruction that can shift the 16-bit immediate by a given amount:: 116 117 mov $dst, 0x1234 << 2 118 119This replaces ``lui`` on MIPS (just use a shift of 16) while also allowing the 120quick construction of small bitfields, which comes in handy in various places. 121 122.. _afuc-alu-cmp: 123 124The ``cmp`` instruction returns: 125 126- ``0x00`` if src1 > src2 127- ``0x2b`` if src1 == src2 128- ``0x1e`` if src1 < src2 129 130See explanation in :ref:`afuc-branch` 131 132 133.. _afuc-branch: 134 135Branch Instructions 136=================== 137 138The following branch/jump instructions are available: 139 140- ``brne`` - branch if not equal (or bit not set) 141- ``breq`` - branch if equal (or bit set) 142- ``jump`` - unconditional jump 143 144Both ``brne`` and ``breq`` have two forms, comparing the src register 145against either a small immediate (up to 5 bits) or a specific bit:: 146 147 breq $src, b3, #somelabel ; branch if src & (1 << 3) 148 breq $src, 0x3, #somelabel ; branch if src == 3 149 150The branch instructions are encoded with a 16b relative offset. 151Since ``$00`` always reads back zero, it can be used to construct 152an unconditional relative jump. 153 154The :ref:`cmp <afuc-alu-cmp>` instruction can be paired with the 155bit-test variants of ``brne``/``breq`` to implement gt/ge/lt/le, 156due to the bit pattern it returns, for example:: 157 158 cmp $04, $02, $03 159 breq $04, b1, #somelabel 160 161will branch if ``$02`` is less than or equal to ``$03``. 162 163Delay slots 164----------- 165 166Branch instructions have a delay slot so the following instruction is always 167executed regardless of whether branch is taken or not. Unlike MIPS, a branch in 168the delay slot is legal as long as the original branch and the branch in its 169delay slot are never both taken. Because jump tables are awkward and slow due 170to the lack of memory caching, this is often exploited to create dense 171sequences of branches to implement switch-case constructs:: 172 173 breq $02, 0x1, #foo 174 breq $02, 0x2, #bar 175 breq $02, 0x3, #baz 176 ... 177 nop 178 jump #default 179 180Another common use of a branch in a delay slot is a double-jump (jump to one 181location if a condition is true, and another location if false). In MIPS this 182requires two delay slots:: 183 184 beq $t0, 0x1, #foo 185 nop ; beq delay slot 186 b #bar 187 nop ; b delay slot 188 189In afuc this only requires a delay slot for the second branch:: 190 191 breq $02, 0x1, #foo 192 brne $02, 0x1, #bar 193 nop 194 195Note that for the second branch we had to use a conditional branch with the 196opposite condition instead of an unconditional branch as in the MIPS example, 197to guarantee that at most one is ever taken. 198 199.. _afuc-call: 200 201Call/Return 202=========== 203 204Simple subroutines can be implemented with ``call``/``ret``. The 205jump instruction encodes a fixed offset from the SQE instruction base. 206 207 TODO not sure how many levels deep function calls can be nested. 208 There isn't really a stack. Definitely seems to be multiple 209 levels of fxn call, see in PFP: CP_CONTEXT_SWITCH_YIELD -> f13 -> 210 f22. 211 212.. _afuc-nop: 213 214NOPs 215==== 216 217Afuc has a special NOP encoding where the low 24 bits are ignored by the 218processor. On a5xx the high 8 bits are ``00``, on a6xx they are ``01`` 219(probably to make sure that 0 is not a legal instruction, increasing the 220chances of halting immediately when something is misconfigured). This is used 221sometimes to create a "payload" that is ignored when executed. For example, the 222first 2 instructions of the firmware typically contain the firmware ID and 223version followed by the packet handling table offset encoded as NOPs. They are 224skipped when executed but they are later read as data by the bootstrap routine. 225 226.. _afuc-control: 227 228Control Registers 229================= 230 231Control registers are a special register space that can only be read/written 232directly by CP through ``cread``/``cwrite`` instructions:: 233 234- ``cread $dst, [$off + addr], flags`` 235- ``cread $dst, [$off + addr]!, flags`` 236- ``cwrite $src, [$off + addr], flags`` 237- ``cwrite $src, [$off + addr]!, flags`` 238 239Control registers ``0x000`` to ``0x0ff`` are private registers used to control 240the CP, for example to indicate where to read from memory or (normal) 241registers. ``0x100`` to ``0x17f`` are a private scratch space used by the 242firmware however it wants, for example as an ad-hoc stack to spill registers 243when calling a function or to store the scratch used in ``CP_SCRATCH_TO_*`` 244packets. Starting with the introduction of LPAC, ``0x200`` to ``0x27f`` are a 245shared scratch space used to communicate between processors and on a7xx they 246can also be written on event completion to implement so-called "on-chip 247timestamps". 248 249In cases where no offset is needed, ``$00`` is frequently used as the offset. 250 251The addressing mode with ``!`` is a pre-increment mode that writes the final 252address ``$off + addr`` to ``$off``. 253 254For example, the following sequences sets:: 255 256 ; load CP_INDIRECT_BUFFER parameters from cmdstream: 257 mov $02, $data ; low 32b of IB target address 258 mov $03, $data ; high 32b of IB target 259 mov $04, $data ; IB size in dwords 260 261 ; sanity check # of dwords: 262 breq $04, 0x0, #l23 263 264 ; this seems something to do with figuring out whether 265 ; we are going from RB->IB1 or IB1->IB2 (ie. so the 266 ; below cwrite instructions update either 267 ; CP_IB1_BASE_LO/HI/BUFSIZE or CP_IB2_BASE_LO/HI/BUFSIZE 268 and $05, $18, 0x0003 269 shl $05, $05, 0x0002 270 271 ; update CP_IBn_BASE_LO/HI/BUFSIZE: 272 cwrite $02, [$05 + 0x0b0], 0x8 273 cwrite $03, [$05 + 0x0b1], 0x8 274 cwrite $04, [$05 + 0x0b2], 0x8 275 276Unlike normal GPU registers, writing control registers seems to always take 277effect immediately; if writing a control register triggers some complex 278operation that the firmware needs to wait for, then it typically uses a 279spinloop with another control register to wait for it to finish. 280 281Control registers are documented in ``adreno_control_regs.xml``. The 282disassembler will try to recognize an immediate address as a known control 283register and print it, for example this sequence similar to the above sequence 284but on a6xx:: 285 286 and $05, $12, 0x0003 287 shl $05, $05, 0x0002 288 cwrite $0e, [$05 + @IB1_BASE], 0x0 289 cwrite $0b, [$05 + @IB1_BASE+0x1], 0x0 290 cwrite $04, [$05 + @IB1_DWORDS], 0x0 291 292.. _afuc-sqe-regs: 293 294SQE Registers 295============= 296 297Starting with a6xx, the state of the SQE processor itself can be accessed 298through ``sread``/``swrite`` instructions that work identically to 299``cread``/``cwrite``. For example, this includes the state of the 300``call``/``ret`` stack. This is mainly used during the preemption routine but 301it's also used to set the entrypoint for preemption. 302 303.. _afuc-read: 304 305Reading Memory and Registers 306============================ 307 308The CP accesses memory directly with no caching. This means that except for 309very small amounts of data accessed rarely, ``load`` and ``store`` are very 310slow. Instead, ME/PFP and later SQE read memory through various queues. Reading 311registers also use a queue, likely because burst reading several registers at 312once is faster than reading them one-by-one and reading does not complete 313immediately. Queueing up a read involves writing a (address, length) pair to a 314control register, and data is read from the queue using one of three special 315registers: 316 317- ``$data`` reads the next PM4 packet word. This comes from the RB, IB1, IB2, 318 or SDS (Set Draw State) queue, controlled by ``@IB_LEVEL``. It also 319 decrements ``$rem`` if it isn't already decremented by a rep prefix. 320- ``$memdata`` reads the next word from a memory read buffer (MRB) setup by 321 writing ``@MEM_READ_ADDR``/``@MEM_READ_DWORDS``. It's used by things like 322 ``CP_MEMCPY`` and reading indirect draw parameters in ``CP_DRAW_INDIRECT``. 323- ``$regdata`` reads from a register read buffer (RRB) setup by 324 ``@REG_READ_ADDR``/``@REG_READ_DWORDS``. 325 326RB, IB1, IB2, SDS, and MRB make up the Read-Only Queue or ROQ, in addition to 327the Visibility Stream Decoder (VSD) which is setup via a similar control 328register pair but is read by a fixed-function parser that the CP accesses via a 329few control registers. 330 331.. _afuc-reg-writes: 332 333Writing Registers 334================= 335 336The same special registers, when used as a destination, can be used to 337write GPU registers on ME. Because they have a totally different function when 338used as a destination, they use different names: 339 340- ``$addr`` sets the address and disables ``CP_PROTECT`` address checking. 341- ``$usraddr`` sets the address and checks it against the ``CP_PROTECT`` access 342 table. It's used for addresses specified by the PM4 packet stream instead of 343 internally. 344- ``$data`` writes the register and auto-increments the address. 345 346for example, to write:: 347 348 mov $addr, CP_SCRATCH_REG[0x2] ; set register to write 349 mov $data, $03 ; CP_SCRATCH_REG[0x2] 350 mov $data, $04 ; CP_SCRATCH_REG[0x3] 351 ... 352 353subsequent writes to ``$data`` will increment the address of the register 354to write, so a sequence of consecutive registers can be written. On a5xx ME, 355this will directly write the register, on a6xx SQE this will instead determine 356which cluster(s) the register belongs to and push the write onto the 357appropriate per-cluster queue(s) letting the SQE run ahead of the GPU. 358 359When bit 18 of ``$addr`` is set, the auto-incrementing is disabled. This is 360often used with :ref:`afuc-mem-writes <NRT_DATA>`. 361 362On a5xx ME, ``$regdata`` can also be used to directly read a register:: 363 364 mov $addr, CP_SCRATCH_REG[0x2] 365 mov $03, $regdata 366 mov $04, $regdata 367 368This does not exist on a6xx because register reads are not synchronized against 369writes any more. 370 371Many registers that are updated frequently have two banks, so they can be 372updated without stalling for previous draw to finish. On a5xx, these banks are 373arranged so bit 11 is zero for bank 0 and 1 for bank 1. The ME fw (at 374least the version I'm looking at) stores this in ``$17``, so to update these 375registers from ME:: 376 377 or $addr, $17, VFD_INDEX_OFFSET 378 mov $data, $03 379 ... 380 381On a6xx this is handled transparently to the SQE, and the bank to use is stored 382separately in the cluster queue. 383 384Registers can also be written directly, skipping the queue, by writing 385``@REG_WRITE_ADDR``/``@REG_WRITE``. This is used on a6xx for certain frontend 386registers that have their own queues and on a5xx is used by the PFP:: 387 388 mov $0c, CP_SCRATCH_REG[0x7] 389 mov $02, 0x789a ; value 390 cwrite $0c, [$00 + @REG_WRITE_ADDR], 0x8 391 cwrite $02, [$00 + @REG_WRITE], 0x8 392 393Like with the ``$addr``/``$data`` approach, the destination register address 394increments on each write to ``@REG_WRITE``. 395 396.. _afuc-pipe-regs: 397 398Pipe Registers 399-------------- 400 401This yet another private register space, triggered by writing to the high 8 402bits of ``$addr`` and then writing ``$data`` normally. Some pipe registers like 403``WAIT_MEM_WRITES`` or ``WAIT_GPU_IDLE`` have no data and a write is triggered 404immediately when ``$addr`` is written, for example in ``CP_WAIT_MEM_WRITES``:: 405 406 mov $addr, 0x0084 << 24 ; |WAIT_MEM_WRITES 407 408The pipe register is decoded here by the disassembler in a comment. 409 410The main difference of pipe registers from control registers are: 411 412- They are always write-only. 413- On a6xx they are pipelined together with normal register writes, on a5xx they 414 are written from ME like normal registers. 415- Writing them can take an arbitrary amount of time, so they can be used to 416 wait for some condition without spinning. 417 418In short, they behave more like normal registers but are not expected to be 419read/written by anything other than CP. Over time more and more GPU registers 420not touched by the kernel driver have been converted to pipe registers. 421 422.. _afuc-mem-writes: 423 424Writing Memory 425============== 426 427Writing memory is done by writing GPU registers: 428 429- ``CP_ME_NRT_ADDR_LO``/``_HI`` - write to set the address to read or write 430- ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR``. 431 432The address register increments with successive writes. 433 434On a5xx, this seems to be only used by ME. If PFP were also using it, they would 435race with each other. It can also be used for reads, primarily small reads. 436 437Memory Write example:: 438 439 ; store 64b value in $04+$05 to 64b address in $02+$03 440 mov $addr, CP_ME_NRT_ADDR_LO 441 mov $data, $02 442 mov $data, $03 443 mov $addr, CP_ME_NRT_DATA 444 mov $data, $04 445 mov $data, $05 446 447Memory Read example:: 448 449 ; load 64b value from address in $02+$03 into $04+$05 450 mov $addr, CP_ME_NRT_ADDR_LO 451 mov $data, $02 452 mov $data, $03 453 mov $04, $addr 454 mov $05, $addr 455 456On a6xx ``CP_ME_NRT_ADDR`` and ``CP_ME_NRT_DATA`` have been replaced by 457:ref:`afuc-pipe-regs <pipe registers>` and they can only be used for writes but 458it otherwise works similarly. 459 460Load and Store Instructions 461=========================== 462 463a6xx adds ``load`` and ``store`` instruction that work similarly to ``cread`` 464and ``cwrite``. Because the address is 64-bits but registers are 32-bit, the 465high 32 bits come from the ``@LOAD_STORE_HI`` 466:ref:`afuc-control <control register>`. They are mostly used by the context 467switch routine and even then very sparingly, before the memory read/write queue 468state is saved while it is being restored. 469 470Modifiers 471========= 472 473There are two modifiers that enable more compact and efficient implementations 474of common patterns: 475 476.. _afuc-rep: 477 478Repeat 479------ 480 481``(rep)`` repeats the same instruction ``$rem`` times. More precisely, it 482decrements ``$rem`` after the instruction executes if it wasn't already 483decremented from a read from ``$data`` and re-executes the instruction until 484``$rem`` is 0. It can be used with ALU instructions and control instructions. 485Usually it is used in conjunction with ``$data`` to read the rest of the packet 486in one instruction, but it can also be used freestanding, for example this 487snippet clears the control register scratch space:: 488 489 mov $rem, 0x0080 ; clear 0x80 registers 490 mov $03, 0x00ff ; start at 0xff + 1 = 0x100 491 (rep)cwrite $00, [$03 + 0x001], 0x4 492 493Note the use of pre-increment mode, so that the first execution clears 494``0x100`` and updates ``$03`` to ``0x100``, the second execution clears 495``0x101`` and updates ``$03`` to ``0x101``, and so on. 496 497.. _afuc-xmov: 498 499eXtra Moves 500----------- 501 502``(xmovN)`` is an optimization which lets the firmware read multiple words from 503a queue in the same cycle. Conceptually, it adds "extra" mov instructions to be 504executed after a given ALU instruction, although in practice they are likely 505executed in parallel. ``(xmov1)`` adds up to 1 move, ``(xmov2)`` adds up to 2, 506and ``(xmov3)`` adds up to 3. The actual number of moves added is the minimum 507of the number in the instruction and ``$rem``, so a ``(xmov3)`` instruction 508behaves like a ``(xmov1)`` instruction if ``$rem = 1``. Given an instruction:: 509 510 (xmovN) alu $dst, $src1, $src2 511 512or a 1-source instruction:: 513 514 (xmovN) alu $dst, $src2 515 516then we compute the number of extra moves ``M = min(N, $rem)``. If ``M = 1``, 517then we add:: 518 519 mov $data, $src2 520 521If ``M = 2``, then we add:: 522 523 mov $data, $src2 524 mov $data, $src2 525 526Finally, as a special case explained below, if ``M = 3`` then we add:: 527 528 mov $data, $src2 529 mov $dst, $src2 ; !!! 530 mov $data, $src2 531 532If ``$dst`` is not one of the "special" registers ``$data``, ``$addr``, 533``$usraddr``, then ``$data`` is replaced by ``$00`` in all destinations, i.e. 534the results of the subsequent moves are discarded. 535 536The purpose of the ``M = 3`` special case is mostly to efficiently implement 537``CP_CONTEXT_REG_BUNCH``. This is the entire implementation of 538``CP_CONTEXT_REG_BUNCH``, which is essentially just one instruction:: 539 540 CP_CONTEXT_REG_BUNCH: 541 (rep)(xmov3)mov $usraddr, $data 542 waitin 543 mov $01, $data 544 545If there are 4 or more words remaining in the packet, that is if there are at 546least two more registers to write, then (ignoring the ``(rep)`` for a moment) 547the instruction expands to:: 548 549 mov $usraddr, $data 550 mov $data, $data 551 mov $usraddr, $data 552 mov $data, $data 553 554This is likely all executed in a single cycle, allowing us to write 2 registers 555per cycle. 556 557``(xmov1)`` can be also added to ``(rep)mov $data, $data``, which is a common 558pattern to write the rest of the packet to successive registers, to write up to 5592 registers per cycle as well. The firmware does not use ``(xmov3)``, however, 560so 2 registers per cycle is likely a hardware limitation. 561 562Although ``(xmovN)`` is often used in combination with ``(rep)``, it doesn't 563have to be. For example, ``(xmov1)mov $data, $data`` moves the next 2 packet 564words to 2 successive registers. 565 566.. _afuc-sds: 567 568Set Draw State 569-------------- 570 571``(sdsN)`` is a modifier for ``cwrite`` used to accelerate 572``CP_SET_DRAW_STATE``. For each draw state group to update, 573``CP_SET_DRAW_STATE`` needs to copy 3 words from the packet containing the 574group to update, metadata, and base address plus size. Using the ``(sds2)`` 575modifier as well as ``(rep)``, this can be accomplished in a single 576instruction:: 577 578 (rep)(sds2)cwrite $data, [$00 + @DRAW_STATE_SET_HDR] 579 580The first word containing the header is written to ``@DRAW_STATE_SET_HDR``, and 581the second and third words containing the draw state base come from reading the 582source again twice and are written directly to the draw state RAM. 583 584In testing with other control registers, ``(sdsN)`` causes the source to be 585read ``N`` extra times and then thrown away. Only when used in combination with 586``@DRAW_STATE_SET_HDR`` do the extra source reads have an effect. 587 588.. _afuc-peek: 589 590Peek 591---- 592 593``(peek)`` is valid on ALU instructions without an immediate. It modifies what 594``$data`` (and possibly ``$memdata`` and ``$regdata``) do by making them avoid 595consuming the word. The next read to ``$data`` will return the same thing. This 596is used solely by ``CP_INDIRECT_BUFFER`` to test if there is a subsequent IB 597that can be prefetched while the first IB is executed without actually 598consuming the header for the next packet. It is introduced on a7xx, and 599replaces the use of a special control register. 600 601Packet Table 602============ 603 604The core of the microprocessor's job is to parse each packet header and jump to 605its handler. This is done through a ``waitin`` instruction which waits for the 606packet header to become available and then parses the header and jumps to the 607handler using a jump table. However it does *not* actually consume the header. 608Like any branch instruction, it has a delay slot, and by convention this delay 609slot always contains a ``mov $01, $data`` instruction. This consumes the same 610header that ``waitin`` parsed and puts it in ``$01`` so that the packet header 611is available in ``$01`` in the next packet. Thus all packet handlers end with 612this sequence:: 613 614 waitin 615 mov $01, $data 616 617The jump table itself is initialized by the SQE in the bootstrap routine at the 618beginning of the firmware. Amongst other tasks, it reads the offset of the jump 619table from the NOP payload at the beginning, then uses a jump table embedded at 620the end of the firmware to set it up by writing the ``@PACKET_TABLE_WRITE`` 621control register. After everything is setup, it does the ``waitin`` sequence 622to start handling the first packet (which should be ``CP_ME_INIT``). 623 624Example Packet 625============== 626 627Let's examine an implementation of ``CP_MEM_WRITE``:: 628 629 CP_MEM_WRITE: 630 mov $addr, 0x00a0 << 24 ; |NRT_ADDR 631 632First, we setup the register to write to, which is the ``NRT_ADDR`` 633:ref:`afuc-pipe-regs <pipe register>`. It turns out that the low 2 bits of 634``NRT_ADDR`` are a flag which when 1 disables auto-incrementing ``NRT_ADDR`` 635when ``NRT_DATA`` is written, but we don't want this behavior so we have to 636make sure they are clear:: 637 638 or $02, $data, 0x0003 ; reading $data reads the next PM4 word 639 xor $data, $02, 0x0003 ; writing $data writes the register, which is NRT_ADDR 640 641Writing ``$data`` auto-increments ``$addr``, so now the next write is to 642``0xa1`` or ``NRT_ADDR+1`` (``NRT_ADDR`` is a 64-bit register):: 643 644 mov $data, $data 645 646Now, we have to write ``NRT_DATA``. We want to repeatedly write the same 647register, without having to fight the auto-increment by resetting ``$addr`` 648each time, which is where the bit 18 that disables auto-increment comes in 649handy:: 650 651 mov $addr, 0xa204 << 16 ; |NRT_DATA 652 653Finally, we have to repeatedly copy the remaining PM4 packet data to the 654``NRT_DATA`` register, which we can do in one instruction with 655:ref:`afuc-rep <(rep)>`. Furthermore we can use :ref:`afuc-xmov <(xmov1)>` to 656squeeze out some more performance:: 657 658 (rep)(xmov1)mov $data, $data 659 660At the end is the standard go-to-next-packet sequence:: 661 662 waitin 663 mov $01, $data 664 665Reassembling Firmwares 666====================== 667 668Of course, the main use of assembling is to take the firmware you're using, 669modify it to test something, and reassemble it. Reassembling a firmware should 670work out-of-the-box, and should give you back an identical firmware, but there 671is a caveat if you want to reassemble a modified firmware and use preemption. 672The preemption routines contain a few tables embedded in the firmware, and they 673load the offset of the table with a ``mov`` instruction that needs to be turned 674into a relocation and then add it to ``CP_SQE_INSTR_BASE``. ``afuc-asm`` 675supports using labels as immediates for this:: 676 677 foo: 678 [00000000] 679 ... 680 681 mov $02, #foo << 2 ; #foo will be replaced with the offset in words 682 683However, you have to manually insert the labels and replace the constant. On 684a7xx there are multiple tables next to each other that look like one table, so 685be careful to make sure you've found all the places it offsets from 686``CP_SQE_INSTR_BASE``! There are also tables in the BV microcode on a7xx. To 687check that the relocations are correct, check that reassembling an otherwise 688unmodified firmware still gives an identical result after adding the 689relocations. 690 691A6XX NOTES 692========== 693 694The ``$14`` register holds global flags set by: 695 696 CP_SKIP_IB2_ENABLE_LOCAL - b8 697 CP_SKIP_IB2_ENABLE_GLOBAL - b9 698 CP_SET_MARKER 699 MODE=GMEM - sets b15 700 MODE=BLIT2D - clears b15, b12, b7 701 CP_SET_MODE - b29+b30 702 CP_SET_VISIBILITY_OVERRIDE - b11, b21, b30? 703 CP_SET_DRAW_STATE - checks b29+b30 704 705 CP_COND_REG_EXEC - checks b10, which should be predicate flag? 706