xref: /aosp_15_r20/external/mesa3d/docs/drivers/freedreno.rst (revision 6104692788411f58d303aa86923a9ff6ecaded22)
1Freedreno
2=========
3
4Freedreno GLES and GL driver for Adreno 2xx-6xx GPUs.  It implements up to
5OpenGL ES 3.2 and desktop OpenGL 4.5.
6
7See the `Freedreno Wiki
8<https://gitlab.freedesktop.org/freedreno/freedreno/-/wikis/home>`__ for more
9details.
10
11Turnip
12------
13
14Turnip is a Vulkan 1.3 driver for Adreno 6xx GPUs.
15
16The current set of specific chip versions supported can be found in
17:file:`src/freedreno/common/freedreno_devices.py`.  The current set of features
18supported can be found rendered at `Mesa Matrix <https://mesamatrix.net/>`__.
19There are no plans to port to a5xx or earlier GPUs.
20
21Hardware architecture
22---------------------
23
24Adreno is a mostly tile-mode renderer, but with the option to bypass tiling
25("gmem") and render directly to system memory ("sysmem").  It is UMA, using
26mostly write combined memory but with the ability to map some buffers as cache
27coherent with the CPU.
28
29.. toctree::
30   :glob:
31
32   freedreno/hw/*
33
34Hardware acronyms
35^^^^^^^^^^^^^^^^^
36
37.. glossary::
38
39  Cluster
40    A group of hardware registers, often with multiple copies to allow
41    pipelining.  There is an M:N relationship between hardware blocks that do
42    work and the clusters of registers for the state that hardware blocks use.
43
44  CP
45    Command Processor.  Reads the stream of state changes and draw commands
46    generated by the driver.
47
48  PFP
49    Prefetch Parser.  Adreno 2xx-4xx CP component.
50
51  ME
52    Micro Engine. Adreno 2xx-4xx CP component after PFP, handles most PM4 commands.
53
54  SQE
55    a6xx+ replacement for PFP/ME.  This is the microcontroller that runs the
56    microcode (loaded from Linux) which actually processes the command stream
57    and writes to the hardware registers.  See `afuc
58    <https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/afuc/README.rst>`__.
59
60  ROQ
61    DMA engine used by the SQE for reading memory, with some prefetch buffering.
62    Mostly reads in the command stream, but also serves for
63    ``CP_MEMCPY``/``CP_MEM_TO_REG`` and visibility stream reads.
64
65  SP
66    Shader Processor.  Unified, scalar shader engine.  One or more, depending on
67    GPU and tier.
68
69  TP
70    Texture Processor.
71
72  UCHE
73    Unified L2 Cache.  32KB on A330, unclear how big now.
74
75  CCU
76    Color Cache Unit.
77
78  VSC
79    Visibility Stream Compressor
80
81  PVS
82    Primitive Visibility Stream
83
84  FE
85    Front End?  Index buffer and vertex attribute fetch cluster.  Includes PC,
86    VFD, VPC.
87
88  VFD
89    Vertex Fetch and Decode
90
91  VPC
92    Varying/Position Cache?  Hardware block that stores shaded vertex data for
93    primitive assembly.
94
95  HLSQ
96    High Level Sequencer.  Manages state for the SPs, batches up PS invocations
97    between primitives, is involved in preemption.
98
99  PC_VS
100    Cluster where varyings are read from VPC and assembled into primitives to
101    feed GRAS.
102
103  VS
104    Vertex Shader. Responsible for generating VS/GS/tess invocations
105
106  GRAS
107    Rasterizer. Responsible for generating PS invocations from primitives, also
108    does LRZ
109
110  PS
111    Pixel Shader.
112
113  RB
114    Render Backend.  Performs both early and late Z testing, blending, and
115    attachment stores of output of the PS.
116
117  GMEM
118    Roughly 128KB-1MB of memory on the GPU (SKU-dependent), used to store
119    attachments during tiled rendering
120
121  LRZ
122    Low Resolution Z.  A low resolution area of the depth buffer that can be
123    initialized during the binning pass to contain the worst-case (farthest) Z
124    values in a block, and then used to early reject fragments during
125    rasterization.
126
127Cache hierarchy
128^^^^^^^^^^^^^^^
129
130The a6xx GPUs have two main caches: CCU and UCHE.
131
132UCHE (Unified L2 Cache) is the cache behind the vertex fetch, VSC writes,
133texture L1, LRZ, and storage image accesses (``ldib``/``stib``).  Misses and
134flushes access system memory.
135
136The CCU is the separate cache used by 2D blits and sysmem render target access
137(and also for resolves to system memory when in GMEM mode).  Its memory comes
138from a carveout of GMEM controlled by ``RB_CCU_CNTL``, with a varying amount
139reserved based on whether we're in a render pass using GMEM for attachment
140storage, or we're doing sysmem rendering.  Cache entries have the attachment
141number and layer mixed into the cache tag in some way, likely so that a
142fragment's access is spread through the cache even if the attachments are the
143same size and alignments in address space.  This means that the cache must be
144flushed and invalidated between memory being used for one attachment and another
145(notably depth vs color, but also MRT color).
146
147The Texture Processors (TP) additionally have a small L1 cache (1KB on A330,
148unclear how big now) before accessing UCHE. This cache is used for normal
149sampling like ``sam``` and ``isam`` (and the compiler will make read-only
150storage image access through it as well).  It is not coherent with UCHE (may get
151stale results when you ``sam`` after ``stib``), but must get flushed per draw or
152something because you don't need a manual invalidate between draws storing to an
153image and draws sampling from a texture.
154
155The command processor (CP) does not read from either of these caches, and
156instead uses FIFOs in the ROQ to avoid stalls reading from system memory.
157
158Draw states
159^^^^^^^^^^^
160
161Since the SQE is not a fast processor, and tiled rendering means that many draws
162won't even be used in many bins, since a5xx state updates can be batched up into
163"draw states" that point to a fragment of CP packets.  At draw time, if the draw
164call is going to actually execute (some primitive is visible in the current
165tile), the SQE goes through the ``GROUP_ID``\s and for any with an update since
166the last time they were executed, it executes the corresponding fragment.
167
168Starting with a6xx, states can be tagged with whether they should be executed
169at draw time for any of sysmem, binning, or tile rendering.  This allows a
170single command stream to be generated which can be executed in any of the modes,
171unlike pre-a6xx where we had to generate separate command lists for the binning
172and rendering phases.
173
174Note that this means that the generated draw state has to always update all of
175the state you have chosen to pack into that ``GROUP_ID``, since any of your
176previous state changes in a previous draw state command may have been skipped.
177
178Pipelining (a6xx+)
179^^^^^^^^^^^^^^^^^^
180
181Most CP commands write to registers.  In a6xx+, the registers are located in
182clusters corresponding to the stage of the pipeline they are used from (see
183``enum tu_stage`` for a list). To pipeline state updates and drawing, registers
184generally have two copies ("contexts") in their cluster, so previous draws can
185be working on the previous set of register state while the next draw's state is
186being set up. You can find what registers go into which clusters by looking at
187:command:`crashdec` output in the ``regs-name: CP_MEMPOOL`` section.
188
189As SQE processes register writes in the command stream, it sends them into a
190per-cluster queue stored in ``CP_MEMPOOL``.  This allows the pipeline stages to
191process their stream of register updates and events independent of each other
192(so even with just 2 contexts in a stage, earlier stages can proceed on to later
193draws before later stages have caught up).
194
195Each cluster has a per-context bit indicating that the context is done/free.
196Register writes will stall on the context being done.
197
198During a 3D draw command, SQE generates several internal events flow through the
199pipeline:
200
201- ``CP_EVENT_START`` clears the done bit for the context when written to the
202  cluster
203- ``PC_EVENT_CMD``/``PC_DRAW_CMD``/``HLSQ_EVENT_CMD``/``HLSQ_DRAW_CMD`` kick off
204  the actual event/drawing.
205- ``CONTEXT_DONE`` event completes after the event/draw is complete and sets the
206  done flag.
207- ``CP_EVENT_END`` waits for the done flag on the next context, then copies all
208  the registers that were dirtied in this context to that one.
209
210The 2D blit engine has its own ``CP_2D_EVENT_START``, ``CP_2D_EVENT_END``,
211``CONTEXT_DONE_2D``, so 2D and 3D register contexts can do separate context
212rollover.
213
214Because the clusters proceed independently of each other even across draws, if
215you need to synchronize an earlier cluster to the output of a later one, then
216you will need to ``CP_WAIT_FOR_IDLE`` after flushing and invalidating any
217necessary caches.
218
219Also, note that some registers are not banked at all, and will require a
220``CP_WAIT_FOR_IDLE`` for any previous usage of the register to complete.
221
222In a2xx-a4xx, there weren't per-stage clusters, and instead there were two
223register banks that were flipped between per draw.
224
225Bindless/Bindful Descriptors (a6xx+)
226^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
227
228Starting with a6xx++, cat5 (texture) and cat6 (image/SSBO/UBO) instructions are
229extended to support bindless descriptors.
230
231In the old bindful model, descriptors are separate for textures, samplers,
232UBOs, and IBOs (combined descriptor for images and SSBOs), with separate
233registers for the memory containing the array of descriptors, and/or different
234``STATE_TYPE`` and ``STATE_BLOCK`` for ``CP_LOAD_STATE``/``_FRAG``/``_GEOM``
235to pre-load the descriptors into cache.
236
237- textures - per-shader-stage
238   - registers: ``SP_xS_TEX_CONST``/``SP_xS_TEX_COUNT``
239   - state-type: ``ST6_CONSTANTS``
240   - state-block: ``SB6_xS_TEX``
241- samplers - per-shader-stage
242   - registers: ``SP_xS_TEX_SAMP``
243   - state-type: ``ST6_SHADER``
244   - state-block: ``SB6_xS_TEX``
245- UBOs - per-shader-stage
246   - registers: none
247   - state-type: ``ST6_UBO``
248   - state-block: ``SB6_xS_SHADER``
249- IBOs - global across shader 3d stages, separate for compute shader
250   - registers: ``SP_IBO``/``SP_IBO_COUNT`` or ``SP_CS_IBO``/``SP_CS_IBO_COUNT``
251   - state-type: ``ST6_SHADER``
252   - state-block: ``ST6_IBO`` or ``ST6_CS_IBO`` for compute shaders
253   - Note, unlike per-shader-stage descriptors, ``CP_LOAD_STATE6`` is used,
254     as opposed to ``CP_LOAD_STATE6_GEOM`` or ``CP_LOAD_STATE6_FRAG``
255     depending on shader stage.
256
257.. note::
258   For the per-shader-stage registers and state-blocks the ``xS`` notation
259   refers to per-shader-stage names, ex. ``SP_FS_TEX_CONST`` or ``SB6_DS_TEX``
260
261Textures and IBOs (images) use *basically* the same 64byte descriptor format
262with some exceptions (for ex, for IBOs cubemaps are handles as 2d array).
263SSBOs are just untyped buffers, but otherwise use the same descriptors and
264instructions as images.  Samplers use a 16byte descriptor, and UBOs use an
2658byte descriptor which packs the size in the upper 15 bits of the UBO address.
266
267In the bindless model, descriptors are split into 5 descriptor sets, which are
268global across shader stages (but as with bindful IBO descriptors, separate for
2693d stages vs compute stage).  Each HW descriptor is an array of descriptors
270of configurable size (each descriptor set can be configured for a descriptor
271pitch of 8bytes or 64bytes).  Each descriptor can be of arbitrary format (ie.
272UBOs/IBOs/textures/samplers interleaved), it's interpretation by the HW is
273determined by the instruction that references the descriptor.  Each descriptor
274set can contain at least 2^^16 descriptors.
275
276The HW is configured with the base address of the descriptor set via an array
277of "BINDLESS_BASE" registers, ie ``SP_BINDLESS_BASE[n]``/``HLSQ_BINDLESS_BASE[n]``
278for 3d shader stages, or ``SP_CS_BINDLESS_BASE[n]``/``HLSQ_CS_BINDLESS_BASE[n]``
279for compute shaders, with the descriptor pitch encoded in the low bits.
280Which of the descriptor sets is referenced is encoded via three bits in the
281instruction.  The address of the descriptor is calculated as::
282
283   descriptor_addr = (BINDLESS_BASE[n] & ~0x3) +
284                     (idx * 4 * (2 << BINDLESS_BASE[n] & 0x3))
285
286
287.. note::
288   Turnip reserves one descriptor set for internal use and exposes the other
289   four for the application via the Vulkan API.
290
291Software Architecture
292---------------------
293
294Freedreno and Turnip use a shared core for shader compiler, image layout, and
295register and command stream definitions.  They implement separate state
296management and command stream generation.
297
298.. toctree::
299   :glob:
300
301   freedreno/*
302
303GPU devcoredump
304^^^^^^^^^^^^^^^^^^
305
306A kernel message from DRM of "gpu fault" can mean any sort of error reported by
307the GPU (including its internal hang detection).  If a fault in GPU address
308space happened, you should expect to find a message from the iommu, with the
309faulting address and a hardware unit involved:
310
311.. code-block:: text
312
313  *** gpu fault: ttbr0=000000001c941000 iova=000000010066a000 dir=READ type=TRANSLATION source=TP|VFD (0,0,0,1)
314
315On a GPU fault or hang, a GPU core dump is taken by the DRM driver and saved to
316``/sys/devices/virtual/devcoredump/**/data``.  You can cp that file to a
317:file:`crash.devcore` to save it, otherwise the kernel will expire it
318eventually. Echo 1 to the file to free the core early, as another core won't be
319taken until then.
320
321Once you have your core file, you can use :command:`crashdec -f crash.devcore`
322to decode it.  The output will have ``ESTIMATED CRASH LOCATION`` where we
323estimate the CP to have stopped.  Note that it is expected that this will be
324some distance past whatever state triggered the fault, given GPU pipelining, and
325will often be at some ``CP_REG_TO_MEM`` (which waits on previous WFIs) or
326``CP_WAIT_FOR_ME`` (which waits for all register writes to land) or similar
327event. You can try running the workload with ``TU_DEBUG=flushall`` or
328``FD_MESA_DEBUG=flush`` to try to close in on the failing commands.
329
330You can also find what commands were queued up to each cluster in the
331``regs-name: CP_MEMPOOL`` section.
332
333If ``ESTIMATED CRASH LOCATION`` doesn't exist you could find ``CP_SQE_STAT``,
334though going here is the last resort and likely won't be helpful.
335
336.. code-block::
337
338  indexed-registers:
339    - regs-name: CP_SQE_STAT
340      dwords: 51
341  	 PC: 00d7                                <-------------
342  	PKT: CP_LOAD_STATE6_FRAG
343  	$01: 70348003		$11: 00000000
344  	$02: 20000000		$12: 00000022
345
346The ``PC`` value is an instruction address in the current firmware.
347You would need to disassemble the firmware (/lib/firmware/qcom/aXXX_sqe.fw) via:
348
349.. code-block:: sh
350
351  afuc-disasm -v a650_sqe.fw > a650_sqe.fw.disasm
352
353Now you should search for PC value in the disassembly, e.g.:
354
355.. code-block::
356
357  l018:	00d1: 08dd0001  add $addr, $06, 0x0001
358       	00d2: 981ff806  mov $data, $data
359       	00d3: 8a080001  mov $08, 0x0001 << 16
360       	00d4: 3108ffff  or $08, $08, 0xffff
361       	00d5: 9be8f805  and $data, $data, $08
362       	00d6: 9806e806  mov $addr, $06
363       	00d7: 9803f806  mov $data, $03           <------------- HERE
364       	00d8: d8000000  waitin
365       	00d9: 981f0806  mov $01, $data
366
367
368Command Stream Capture
369^^^^^^^^^^^^^^^^^^^^^^
370
371During Mesa development, it's often useful to look at the command streams we
372send to the kernel.  We have an interface for the kernel to capture all
373submitted command streams:
374
375.. code-block:: sh
376
377  cat /sys/kernel/debug/dri/0/rd > cmdstream &
378
379By default, command stream capture does not capture texture/vertex/etc. data.
380You can enable capturing all the BOs with:
381
382.. code-block:: sh
383
384  echo Y > /sys/module/msm/parameters/rd_full
385
386Note that, since all command streams get captured, it is easy to run the system
387out of memory doing this, so you probably don't want to enable it during play of
388a heavyweight game.  Instead, to capture a command stream within a game, you
389probably want to cause a crash in the GPU during a frame of interest so that a
390single GPU core dump is generated.  Emitting ``0xdeadbeef`` in the CS should be
391enough to cause a fault.
392
393``fd_rd_output`` facilities provide support for generating the command stream
394capture from inside Mesa. Different ``FD_RD_DUMP`` options are available:
395
396- ``enable`` simply enables dumping the command stream on each submit for a
397  given logical device. When a more advanced option is specified, ``enable`` is
398  implied as specified.
399- ``combine`` will combine all dumps into a single file instead of writing the
400  dump for each submit into a standalone file.
401- ``full`` will dump every buffer object, which is necessary for replays of
402  command streams (see below).
403- ``trigger`` will establish a trigger file through which dumps can be better
404  controlled. Writing a positive integer value into the file will enable dumping
405  of that many subsequent submits. Writing -1 will enable dumping of submits
406  until disabled. Writing 0 (or any other value) will disable dumps.
407
408Output dump files and trigger file (when enabled) are hard-coded to be placed
409under ``/tmp``, or ``/data/local/tmp`` under Android. `FD_RD_DUMP_TESTNAME` can
410be used to specify a more descriptive prefix for the output or trigger files.
411
412Functionality is generic to any Freedreno-based backend, but is currently only
413integrated in the MSM backend of Turnip. Using the existing ``TU_DEBUG=rd``
414option will translate to ``FD_RD_DUMP=enable``.
415
416Capturing Hang RD
417+++++++++++++++++
418
419Devcore file doesn't contain all submitted command streams, only the hanging one.
420Additionally it is geared towards analyzing the GPU state at the moment of the crash.
421
422Alternatively, it's possible to obtain the whole submission with all command
423streams via ``/sys/kernel/debug/dri/0/hangrd``:
424
425.. code-block:: sh
426
427  sudo cat /sys/kernel/debug/dri/0/hangrd > logfile.rd // Do the cat _before_ the expected hang
428
429The format of hangrd is the same as in ordinary command stream capture.
430``rd_full`` also has the same effect on it.
431
432Replaying Command Stream
433^^^^^^^^^^^^^^^^^^^^^^^^
434
435``replay`` tool allows capturing and replaying ``rd`` to reproduce GPU faults.
436Especially useful for transient GPU issues since it has much higher chances to
437reproduce them.
438
439Dumping rendering results or even just memory is currently unsupported.
440
441- Replaying command streams requires kernel with ``MSM_INFO_SET_IOVA`` support.
442- Requires ``rd`` capture to have full snapshots of the memory (``rd_full`` is enabled).
443
444Replaying is done via ``replay`` tool:
445
446.. code-block:: sh
447
448  ./replay test_replay.rd
449
450More examples:
451
452.. code-block:: sh
453
454  ./replay --first=start_submit_n --last=last_submit_n test_replay.rd
455
456.. code-block:: sh
457
458  ./replay --override=0 test_replay.rd
459
460Editing Command Stream (a6xx+)
461^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
462
463While replaying a fault is useful in itself, modifying the capture to
464understand what causes the fault could be even more useful.
465
466``rddecompiler`` decompiles a single cmdstream from ``rd`` into compilable C source.
467Given the address space bounds the generated program creates a new ``rd`` which
468could be used to override cmdstream with 'replay'. Generated ``rd`` is not replayable
469on its own and depends on buffers provided by the source ``rd``.
470
471C source could be compiled by putting it into src/freedreno/decode/generate-rd.cc.
472
473The workflow would look like this:
474
4751. Find the cmdstream № you want to edit;
4762. Decompile it:
477
478.. code-block:: sh
479
480  ./rddecompiler -s %cmd_stream_n% example.rd > src/freedreno/decode/generate-rd.cc
481
4823. Edit the command stream;;
4834. Compile and deploy freedreno tools;
4845. Plug the generator into cmdstream replay:
485
486.. code-block:: sh
487
488  ./replay --override=%cmd_stream_№%
489
4906. Repeat 3-5.
491
492GPU Hang Debugging
493^^^^^^^^^^^^^^^^^^
494
495Not a guide for how to do it but mostly an enumeration of methods.
496
497Useful ``TU_DEBUG`` (for Turnip) options to narrow down the hang cause:
498
499``sysmem``, ``gmem``, ``nobin``, ``forcebin``, ``noubwc``, ``nolrz``, ``flushall``, ``syncdraw``, ``rast_order``
500
501Useful ``FD_MESA_DEBUG`` (for Freedreno) options:
502
503``sysmem``, ``gmem``, ``nobin``, ``noubwc``, ``nolrz``, ``notile``, ``dclear``, ``ddraw``, ``flush``, ``inorder``, ``noblit``
504
505Useful ``IR3_SHADER_DEBUG`` options:
506
507``nouboopt``, ``spillall``, ``nopreamble``, ``nofp16``
508
509Use Graphics Flight Recorder to narrow down the place which hangs,
510use our own breadcrumbs implementation in case of unrecoverable hangs.
511
512In case of faults use RenderDoc to find the problematic command. If it's
513a draw call, edit shader in RenderDoc to find whether it culprit is a shader.
514If yes, bisect it.
515
516If editing the shader messes the assembly too much and the issue becomes unreproducible
517try editing the assembly itself via ``IR3_SHADER_OVERRIDE_PATH``.
518
519If fault or hang is transient try capturing an ``rd`` and replay it. If issue
520is reproduced - bisect the GPU packets until the culprit is found.
521
522Do the above if culprit is not a shader.
523
524The hang recovery mechanism in Kernel is not perfect, in case of unrecoverable
525hangs check whether the kernel is up to date and look for unmerged patches
526which could improve the recovery.
527
528GPU Breadcrumbs
529+++++++++++++++
530
531Breadcrumbs described below are available only in Turnip.
532
533Freedreno has simpler breadcrumbs, in debug build writes breadcrumbs
534into ``CP_SCRATCH_REG[6]`` and per-tile breadcrumbs into ``CP_SCRATCH_REG[7]``,
535in this way they are available in the devcoredump. TODO: generalize Tunip's
536breadcrumbs implementation.
537
538This is a simple implementations of breadcrumbs tracking of GPU progress
539intended to be a last resort when debugging unrecoverable hangs.
540For best results use Vulkan traces to have a predictable place of hang.
541
542For ordinary hangs as a more user-friendly solution use GFR
543"Graphics Flight Recorder".
544
545Or breadcrumbs implementation aims to handle cases where nothing can be done
546after the hang. In-driver breadcrumbs also allow more precise tracking since
547we could target a single GPU packet.
548
549While breadcrumbs support gmem, try to reproduce the hang in a sysmem mode
550because it would require much less breadcrumb writes and syncs.
551
552Breadcrumbs settings:
553
554.. code-block:: sh
555
556  TU_BREADCRUMBS=%IP%:%PORT%,break=%BREAKPOINT%:%BREAKPOINT_HITS%
557
558``BREAKPOINT``
559  The breadcrumb starting from which we require explicit ack.
560``BREAKPOINT_HITS``
561  How many times breakpoint should be reached for break to occur.
562  Necessary for a gmem mode and re-usable cmdbuffers in both of which
563  the same cmdstream could be executed several times.
564
565A typical work flow would be:
566
567- Start listening for breadcrumbs on a remote host:
568
569.. code-block:: sh
570
571   nc -lvup $PORT | stdbuf -o0 xxd -pc -c 4 | awk -Wposix '{printf("%u:%u\n", "0x" $0, a[$0]++)}'
572
573- Start capturing command stream;
574- Replay the hanging trace with:
575
576.. code-block:: sh
577
578   TU_BREADCRUMBS=$IP:$PORT,break=-1:0
579
580- Increase hangcheck period:
581
582.. code-block:: sh
583
584   echo -n 60000 > /sys/kernel/debug/dri/0/hangcheck_period_ms
585
586- After GPU hang note the last breadcrumb and relaunch trace with:
587
588.. code-block:: sh
589
590   TU_BREADCRUMBS=%IP%:%PORT%,break=%LAST_BREADCRUMB%:%HITS%
591
592- After the breakpoint is reached each breadcrumb would require
593  explicit ack from the user. This way it's possible to find
594  the last packet which didn't hang.
595
596- Find the packet in the decoded cmdstream.
597
598Debugging random failures
599^^^^^^^^^^^^^^^^^^^^^^^^^
600
601In most cases random GPU faults and rendering artifacts are caused by some kind
602of undefined behavior that falls under the following categories:
603
604- Usage of a stale reg value;
605- Usage of stale memory (e.g. expecting it to be zeroed when it is not);
606- Lack of the proper synchronization.
607
608Finding instances of stale reg reads
609++++++++++++++++++++++++++++++++++++
610
611Turnip has a debug option to stomp the registers with invalid values to catch
612the cases where stale data is read.
613
614.. code-block:: sh
615
616  MESA_VK_ABORT_ON_DEVICE_LOSS=1 \
617  TU_DEBUG_STALE_REGS_RANGE=0x00000c00,0x0000be01 \
618  TU_DEBUG_STALE_REGS_FLAGS=cmdbuf,renderpass \
619  ./app
620
621.. envvar:: TU_DEBUG_STALE_REGS_RANGE
622
623  the reg range in which registers would be stomped. Add ``inverse`` to the
624  flags in order for this range to specify which registers NOT to stomp.
625
626.. envvar:: TU_DEBUG_STALE_REGS_FLAGS
627
628  ``cmdbuf``
629    stomp registers at the start of each command buffer.
630  ``renderpass``
631    stomp registers before each render pass.
632  ``inverse``
633    changes ``TU_DEBUG_STALE_REGS_RANGE`` meaning to
634    "regs that should NOT be stomped".
635
636The best way to pinpoint the reg which causes a failure is to bisect the regs
637range. In case when a fail is caused by combination of several registers
638the ``inverse`` flag may be set to find the reg which prevents the failure.
639