xref: /aosp_15_r20/external/llvm/docs/NVPTXUsage.rst (revision 9880d6810fe72a1726cb53787c6711e909410d58)
1*9880d681SAndroid Build Coastguard Worker=============================
2*9880d681SAndroid Build Coastguard WorkerUser Guide for NVPTX Back-end
3*9880d681SAndroid Build Coastguard Worker=============================
4*9880d681SAndroid Build Coastguard Worker
5*9880d681SAndroid Build Coastguard Worker.. contents::
6*9880d681SAndroid Build Coastguard Worker   :local:
7*9880d681SAndroid Build Coastguard Worker   :depth: 3
8*9880d681SAndroid Build Coastguard Worker
9*9880d681SAndroid Build Coastguard Worker
10*9880d681SAndroid Build Coastguard WorkerIntroduction
11*9880d681SAndroid Build Coastguard Worker============
12*9880d681SAndroid Build Coastguard Worker
13*9880d681SAndroid Build Coastguard WorkerTo support GPU programming, the NVPTX back-end supports a subset of LLVM IR
14*9880d681SAndroid Build Coastguard Workeralong with a defined set of conventions used to represent GPU programming
15*9880d681SAndroid Build Coastguard Workerconcepts. This document provides an overview of the general usage of the back-
16*9880d681SAndroid Build Coastguard Workerend, including a description of the conventions used and the set of accepted
17*9880d681SAndroid Build Coastguard WorkerLLVM IR.
18*9880d681SAndroid Build Coastguard Worker
19*9880d681SAndroid Build Coastguard Worker.. note::
20*9880d681SAndroid Build Coastguard Worker
21*9880d681SAndroid Build Coastguard Worker   This document assumes a basic familiarity with CUDA and the PTX
22*9880d681SAndroid Build Coastguard Worker   assembly language. Information about the CUDA Driver API and the PTX assembly
23*9880d681SAndroid Build Coastguard Worker   language can be found in the `CUDA documentation
24*9880d681SAndroid Build Coastguard Worker   <http://docs.nvidia.com/cuda/index.html>`_.
25*9880d681SAndroid Build Coastguard Worker
26*9880d681SAndroid Build Coastguard Worker
27*9880d681SAndroid Build Coastguard Worker
28*9880d681SAndroid Build Coastguard WorkerConventions
29*9880d681SAndroid Build Coastguard Worker===========
30*9880d681SAndroid Build Coastguard Worker
31*9880d681SAndroid Build Coastguard WorkerMarking Functions as Kernels
32*9880d681SAndroid Build Coastguard Worker----------------------------
33*9880d681SAndroid Build Coastguard Worker
34*9880d681SAndroid Build Coastguard WorkerIn PTX, there are two types of functions: *device functions*, which are only
35*9880d681SAndroid Build Coastguard Workercallable by device code, and *kernel functions*, which are callable by host
36*9880d681SAndroid Build Coastguard Workercode. By default, the back-end will emit device functions. Metadata is used to
37*9880d681SAndroid Build Coastguard Workerdeclare a function as a kernel function. This metadata is attached to the
38*9880d681SAndroid Build Coastguard Worker``nvvm.annotations`` named metadata object, and has the following format:
39*9880d681SAndroid Build Coastguard Worker
40*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm
41*9880d681SAndroid Build Coastguard Worker
42*9880d681SAndroid Build Coastguard Worker   !0 = !{<function-ref>, metadata !"kernel", i32 1}
43*9880d681SAndroid Build Coastguard Worker
44*9880d681SAndroid Build Coastguard WorkerThe first parameter is a reference to the kernel function. The following
45*9880d681SAndroid Build Coastguard Workerexample shows a kernel function calling a device function in LLVM IR. The
46*9880d681SAndroid Build Coastguard Workerfunction ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
47*9880d681SAndroid Build Coastguard Worker
48*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm
49*9880d681SAndroid Build Coastguard Worker
50*9880d681SAndroid Build Coastguard Worker    define float @my_fmad(float %x, float %y, float %z) {
51*9880d681SAndroid Build Coastguard Worker      %mul = fmul float %x, %y
52*9880d681SAndroid Build Coastguard Worker      %add = fadd float %mul, %z
53*9880d681SAndroid Build Coastguard Worker      ret float %add
54*9880d681SAndroid Build Coastguard Worker    }
55*9880d681SAndroid Build Coastguard Worker
56*9880d681SAndroid Build Coastguard Worker    define void @my_kernel(float* %ptr) {
57*9880d681SAndroid Build Coastguard Worker      %val = load float, float* %ptr
58*9880d681SAndroid Build Coastguard Worker      %ret = call float @my_fmad(float %val, float %val, float %val)
59*9880d681SAndroid Build Coastguard Worker      store float %ret, float* %ptr
60*9880d681SAndroid Build Coastguard Worker      ret void
61*9880d681SAndroid Build Coastguard Worker    }
62*9880d681SAndroid Build Coastguard Worker
63*9880d681SAndroid Build Coastguard Worker    !nvvm.annotations = !{!1}
64*9880d681SAndroid Build Coastguard Worker    !1 = !{void (float*)* @my_kernel, !"kernel", i32 1}
65*9880d681SAndroid Build Coastguard Worker
66*9880d681SAndroid Build Coastguard WorkerWhen compiled, the PTX kernel functions are callable by host-side code.
67*9880d681SAndroid Build Coastguard Worker
68*9880d681SAndroid Build Coastguard Worker
69*9880d681SAndroid Build Coastguard Worker.. _address_spaces:
70*9880d681SAndroid Build Coastguard Worker
71*9880d681SAndroid Build Coastguard WorkerAddress Spaces
72*9880d681SAndroid Build Coastguard Worker--------------
73*9880d681SAndroid Build Coastguard Worker
74*9880d681SAndroid Build Coastguard WorkerThe NVPTX back-end uses the following address space mapping:
75*9880d681SAndroid Build Coastguard Worker
76*9880d681SAndroid Build Coastguard Worker   ============= ======================
77*9880d681SAndroid Build Coastguard Worker   Address Space Memory Space
78*9880d681SAndroid Build Coastguard Worker   ============= ======================
79*9880d681SAndroid Build Coastguard Worker   0             Generic
80*9880d681SAndroid Build Coastguard Worker   1             Global
81*9880d681SAndroid Build Coastguard Worker   2             Internal Use
82*9880d681SAndroid Build Coastguard Worker   3             Shared
83*9880d681SAndroid Build Coastguard Worker   4             Constant
84*9880d681SAndroid Build Coastguard Worker   5             Local
85*9880d681SAndroid Build Coastguard Worker   ============= ======================
86*9880d681SAndroid Build Coastguard Worker
87*9880d681SAndroid Build Coastguard WorkerEvery global variable and pointer type is assigned to one of these address
88*9880d681SAndroid Build Coastguard Workerspaces, with 0 being the default address space. Intrinsics are provided which
89*9880d681SAndroid Build Coastguard Workercan be used to convert pointers between the generic and non-generic address
90*9880d681SAndroid Build Coastguard Workerspaces.
91*9880d681SAndroid Build Coastguard Worker
92*9880d681SAndroid Build Coastguard WorkerAs an example, the following IR will define an array ``@g`` that resides in
93*9880d681SAndroid Build Coastguard Workerglobal device memory.
94*9880d681SAndroid Build Coastguard Worker
95*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm
96*9880d681SAndroid Build Coastguard Worker
97*9880d681SAndroid Build Coastguard Worker    @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
98*9880d681SAndroid Build Coastguard Worker
99*9880d681SAndroid Build Coastguard WorkerLLVM IR functions can read and write to this array, and host-side code can
100*9880d681SAndroid Build Coastguard Workercopy data to it by name with the CUDA Driver API.
101*9880d681SAndroid Build Coastguard Worker
102*9880d681SAndroid Build Coastguard WorkerNote that since address space 0 is the generic space, it is illegal to have
103*9880d681SAndroid Build Coastguard Workerglobal variables in address space 0.  Address space 0 is the default address
104*9880d681SAndroid Build Coastguard Workerspace in LLVM, so the ``addrspace(N)`` annotation is *required* for global
105*9880d681SAndroid Build Coastguard Workervariables.
106*9880d681SAndroid Build Coastguard Worker
107*9880d681SAndroid Build Coastguard Worker
108*9880d681SAndroid Build Coastguard WorkerTriples
109*9880d681SAndroid Build Coastguard Worker-------
110*9880d681SAndroid Build Coastguard Worker
111*9880d681SAndroid Build Coastguard WorkerThe NVPTX target uses the module triple to select between 32/64-bit code
112*9880d681SAndroid Build Coastguard Workergeneration and the driver-compiler interface to use. The triple architecture
113*9880d681SAndroid Build Coastguard Workercan be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The
114*9880d681SAndroid Build Coastguard Workeroperating system should be one of ``cuda`` or ``nvcl``, which determines the
115*9880d681SAndroid Build Coastguard Workerinterface used by the generated code to communicate with the driver.  Most
116*9880d681SAndroid Build Coastguard Workerusers will want to use ``cuda`` as the operating system, which makes the
117*9880d681SAndroid Build Coastguard Workergenerated PTX compatible with the CUDA Driver API.
118*9880d681SAndroid Build Coastguard Worker
119*9880d681SAndroid Build Coastguard WorkerExample: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda``
120*9880d681SAndroid Build Coastguard Worker
121*9880d681SAndroid Build Coastguard WorkerExample: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
122*9880d681SAndroid Build Coastguard Worker
123*9880d681SAndroid Build Coastguard Worker
124*9880d681SAndroid Build Coastguard Worker
125*9880d681SAndroid Build Coastguard Worker.. _nvptx_intrinsics:
126*9880d681SAndroid Build Coastguard Worker
127*9880d681SAndroid Build Coastguard WorkerNVPTX Intrinsics
128*9880d681SAndroid Build Coastguard Worker================
129*9880d681SAndroid Build Coastguard Worker
130*9880d681SAndroid Build Coastguard WorkerAddress Space Conversion
131*9880d681SAndroid Build Coastguard Worker------------------------
132*9880d681SAndroid Build Coastguard Worker
133*9880d681SAndroid Build Coastguard Worker'``llvm.nvvm.ptr.*.to.gen``' Intrinsics
134*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
135*9880d681SAndroid Build Coastguard Worker
136*9880d681SAndroid Build Coastguard WorkerSyntax:
137*9880d681SAndroid Build Coastguard Worker"""""""
138*9880d681SAndroid Build Coastguard Worker
139*9880d681SAndroid Build Coastguard WorkerThese are overloaded intrinsics.  You can use these on any pointer types.
140*9880d681SAndroid Build Coastguard Worker
141*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm
142*9880d681SAndroid Build Coastguard Worker
143*9880d681SAndroid Build Coastguard Worker    declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
144*9880d681SAndroid Build Coastguard Worker    declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
145*9880d681SAndroid Build Coastguard Worker    declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
146*9880d681SAndroid Build Coastguard Worker    declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
147*9880d681SAndroid Build Coastguard Worker
148*9880d681SAndroid Build Coastguard WorkerOverview:
149*9880d681SAndroid Build Coastguard Worker"""""""""
150*9880d681SAndroid Build Coastguard Worker
151*9880d681SAndroid Build Coastguard WorkerThe '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
152*9880d681SAndroid Build Coastguard Workeraddress space to a generic address space pointer.
153*9880d681SAndroid Build Coastguard Worker
154*9880d681SAndroid Build Coastguard WorkerSemantics:
155*9880d681SAndroid Build Coastguard Worker""""""""""
156*9880d681SAndroid Build Coastguard Worker
157*9880d681SAndroid Build Coastguard WorkerThese intrinsics modify the pointer value to be a valid generic address space
158*9880d681SAndroid Build Coastguard Workerpointer.
159*9880d681SAndroid Build Coastguard Worker
160*9880d681SAndroid Build Coastguard Worker
161*9880d681SAndroid Build Coastguard Worker'``llvm.nvvm.ptr.gen.to.*``' Intrinsics
162*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
163*9880d681SAndroid Build Coastguard Worker
164*9880d681SAndroid Build Coastguard WorkerSyntax:
165*9880d681SAndroid Build Coastguard Worker"""""""
166*9880d681SAndroid Build Coastguard Worker
167*9880d681SAndroid Build Coastguard WorkerThese are overloaded intrinsics.  You can use these on any pointer types.
168*9880d681SAndroid Build Coastguard Worker
169*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm
170*9880d681SAndroid Build Coastguard Worker
171*9880d681SAndroid Build Coastguard Worker    declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*)
172*9880d681SAndroid Build Coastguard Worker    declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*)
173*9880d681SAndroid Build Coastguard Worker    declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*)
174*9880d681SAndroid Build Coastguard Worker    declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*)
175*9880d681SAndroid Build Coastguard Worker
176*9880d681SAndroid Build Coastguard WorkerOverview:
177*9880d681SAndroid Build Coastguard Worker"""""""""
178*9880d681SAndroid Build Coastguard Worker
179*9880d681SAndroid Build Coastguard WorkerThe '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
180*9880d681SAndroid Build Coastguard Workeraddress space to a pointer in the target address space.  Note that these
181*9880d681SAndroid Build Coastguard Workerintrinsics are only useful if the address space of the target address space of
182*9880d681SAndroid Build Coastguard Workerthe pointer is known.  It is not legal to use address space conversion
183*9880d681SAndroid Build Coastguard Workerintrinsics to convert a pointer from one non-generic address space to another
184*9880d681SAndroid Build Coastguard Workernon-generic address space.
185*9880d681SAndroid Build Coastguard Worker
186*9880d681SAndroid Build Coastguard WorkerSemantics:
187*9880d681SAndroid Build Coastguard Worker""""""""""
188*9880d681SAndroid Build Coastguard Worker
189*9880d681SAndroid Build Coastguard WorkerThese intrinsics modify the pointer value to be a valid pointer in the target
190*9880d681SAndroid Build Coastguard Workernon-generic address space.
191*9880d681SAndroid Build Coastguard Worker
192*9880d681SAndroid Build Coastguard Worker
193*9880d681SAndroid Build Coastguard WorkerReading PTX Special Registers
194*9880d681SAndroid Build Coastguard Worker-----------------------------
195*9880d681SAndroid Build Coastguard Worker
196*9880d681SAndroid Build Coastguard Worker'``llvm.nvvm.read.ptx.sreg.*``'
197*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
198*9880d681SAndroid Build Coastguard Worker
199*9880d681SAndroid Build Coastguard WorkerSyntax:
200*9880d681SAndroid Build Coastguard Worker"""""""
201*9880d681SAndroid Build Coastguard Worker
202*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm
203*9880d681SAndroid Build Coastguard Worker
204*9880d681SAndroid Build Coastguard Worker    declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
205*9880d681SAndroid Build Coastguard Worker    declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
206*9880d681SAndroid Build Coastguard Worker    declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
207*9880d681SAndroid Build Coastguard Worker    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
208*9880d681SAndroid Build Coastguard Worker    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
209*9880d681SAndroid Build Coastguard Worker    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
210*9880d681SAndroid Build Coastguard Worker    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
211*9880d681SAndroid Build Coastguard Worker    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
212*9880d681SAndroid Build Coastguard Worker    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
213*9880d681SAndroid Build Coastguard Worker    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
214*9880d681SAndroid Build Coastguard Worker    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
215*9880d681SAndroid Build Coastguard Worker    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
216*9880d681SAndroid Build Coastguard Worker    declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
217*9880d681SAndroid Build Coastguard Worker
218*9880d681SAndroid Build Coastguard WorkerOverview:
219*9880d681SAndroid Build Coastguard Worker"""""""""
220*9880d681SAndroid Build Coastguard Worker
221*9880d681SAndroid Build Coastguard WorkerThe '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
222*9880d681SAndroid Build Coastguard Workerspecial registers, in particular the kernel launch bounds.  These registers
223*9880d681SAndroid Build Coastguard Workermap in the following way to CUDA builtins:
224*9880d681SAndroid Build Coastguard Worker
225*9880d681SAndroid Build Coastguard Worker   ============ =====================================
226*9880d681SAndroid Build Coastguard Worker   CUDA Builtin PTX Special Register Intrinsic
227*9880d681SAndroid Build Coastguard Worker   ============ =====================================
228*9880d681SAndroid Build Coastguard Worker   ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
229*9880d681SAndroid Build Coastguard Worker   ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
230*9880d681SAndroid Build Coastguard Worker   ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
231*9880d681SAndroid Build Coastguard Worker   ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
232*9880d681SAndroid Build Coastguard Worker   ============ =====================================
233*9880d681SAndroid Build Coastguard Worker
234*9880d681SAndroid Build Coastguard Worker
235*9880d681SAndroid Build Coastguard WorkerBarriers
236*9880d681SAndroid Build Coastguard Worker--------
237*9880d681SAndroid Build Coastguard Worker
238*9880d681SAndroid Build Coastguard Worker'``llvm.nvvm.barrier0``'
239*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^
240*9880d681SAndroid Build Coastguard Worker
241*9880d681SAndroid Build Coastguard WorkerSyntax:
242*9880d681SAndroid Build Coastguard Worker"""""""
243*9880d681SAndroid Build Coastguard Worker
244*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm
245*9880d681SAndroid Build Coastguard Worker
246*9880d681SAndroid Build Coastguard Worker  declare void @llvm.nvvm.barrier0()
247*9880d681SAndroid Build Coastguard Worker
248*9880d681SAndroid Build Coastguard WorkerOverview:
249*9880d681SAndroid Build Coastguard Worker"""""""""
250*9880d681SAndroid Build Coastguard Worker
251*9880d681SAndroid Build Coastguard WorkerThe '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
252*9880d681SAndroid Build Coastguard Workerinstruction, equivalent to the ``__syncthreads()`` call in CUDA.
253*9880d681SAndroid Build Coastguard Worker
254*9880d681SAndroid Build Coastguard Worker
255*9880d681SAndroid Build Coastguard WorkerOther Intrinsics
256*9880d681SAndroid Build Coastguard Worker----------------
257*9880d681SAndroid Build Coastguard Worker
258*9880d681SAndroid Build Coastguard WorkerFor the full set of NVPTX intrinsics, please see the
259*9880d681SAndroid Build Coastguard Worker``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
260*9880d681SAndroid Build Coastguard Worker
261*9880d681SAndroid Build Coastguard Worker
262*9880d681SAndroid Build Coastguard Worker.. _libdevice:
263*9880d681SAndroid Build Coastguard Worker
264*9880d681SAndroid Build Coastguard WorkerLinking with Libdevice
265*9880d681SAndroid Build Coastguard Worker======================
266*9880d681SAndroid Build Coastguard Worker
267*9880d681SAndroid Build Coastguard WorkerThe CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that
268*9880d681SAndroid Build Coastguard Workerimplements many common mathematical functions. This library can be used as a
269*9880d681SAndroid Build Coastguard Workerhigh-performance math library for any compilers using the LLVM NVPTX target.
270*9880d681SAndroid Build Coastguard WorkerThe library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and
271*9880d681SAndroid Build Coastguard Workerthere is a separate version for each compute architecture.
272*9880d681SAndroid Build Coastguard Worker
273*9880d681SAndroid Build Coastguard WorkerFor a list of all math functions implemented in libdevice, see
274*9880d681SAndroid Build Coastguard Worker`libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_.
275*9880d681SAndroid Build Coastguard Worker
276*9880d681SAndroid Build Coastguard WorkerTo accommodate various math-related compiler flags that can affect code
277*9880d681SAndroid Build Coastguard Workergeneration of libdevice code, the library code depends on a special LLVM IR
278*9880d681SAndroid Build Coastguard Workerpass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This
279*9880d681SAndroid Build Coastguard Workerpass looks for calls to the ``@__nvvm_reflect`` function and replaces them
280*9880d681SAndroid Build Coastguard Workerwith constants based on the defined reflection parameters. Such conditional
281*9880d681SAndroid Build Coastguard Workercode often follows a pattern:
282*9880d681SAndroid Build Coastguard Worker
283*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
284*9880d681SAndroid Build Coastguard Worker
285*9880d681SAndroid Build Coastguard Worker  float my_function(float a) {
286*9880d681SAndroid Build Coastguard Worker    if (__nvvm_reflect("FASTMATH"))
287*9880d681SAndroid Build Coastguard Worker      return my_function_fast(a);
288*9880d681SAndroid Build Coastguard Worker    else
289*9880d681SAndroid Build Coastguard Worker      return my_function_precise(a);
290*9880d681SAndroid Build Coastguard Worker  }
291*9880d681SAndroid Build Coastguard Worker
292*9880d681SAndroid Build Coastguard WorkerThe default value for all unspecified reflection parameters is zero.
293*9880d681SAndroid Build Coastguard Worker
294*9880d681SAndroid Build Coastguard WorkerThe ``NVVMReflect`` pass should be executed early in the optimization
295*9880d681SAndroid Build Coastguard Workerpipeline, immediately after the link stage. The ``internalize`` pass is also
296*9880d681SAndroid Build Coastguard Workerrecommended to remove unused math functions from the resulting PTX. For an
297*9880d681SAndroid Build Coastguard Workerinput IR module ``module.bc``, the following compilation flow is recommended:
298*9880d681SAndroid Build Coastguard Worker
299*9880d681SAndroid Build Coastguard Worker1. Save list of external functions in ``module.bc``
300*9880d681SAndroid Build Coastguard Worker2. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc``
301*9880d681SAndroid Build Coastguard Worker3. Internalize all functions not in list from (1)
302*9880d681SAndroid Build Coastguard Worker4. Eliminate all unused internal functions
303*9880d681SAndroid Build Coastguard Worker5. Run ``NVVMReflect`` pass
304*9880d681SAndroid Build Coastguard Worker6. Run standard optimization pipeline
305*9880d681SAndroid Build Coastguard Worker
306*9880d681SAndroid Build Coastguard Worker.. note::
307*9880d681SAndroid Build Coastguard Worker
308*9880d681SAndroid Build Coastguard Worker  ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the
309*9880d681SAndroid Build Coastguard Worker  libdevice functions. It is possible to link two IR modules that have been
310*9880d681SAndroid Build Coastguard Worker  linked against libdevice using different reflection variables.
311*9880d681SAndroid Build Coastguard Worker
312*9880d681SAndroid Build Coastguard WorkerSince the ``NVVMReflect`` pass replaces conditionals with constants, it will
313*9880d681SAndroid Build Coastguard Workeroften leave behind dead code of the form:
314*9880d681SAndroid Build Coastguard Worker
315*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm
316*9880d681SAndroid Build Coastguard Worker
317*9880d681SAndroid Build Coastguard Worker  entry:
318*9880d681SAndroid Build Coastguard Worker    ..
319*9880d681SAndroid Build Coastguard Worker    br i1 true, label %foo, label %bar
320*9880d681SAndroid Build Coastguard Worker  foo:
321*9880d681SAndroid Build Coastguard Worker    ..
322*9880d681SAndroid Build Coastguard Worker  bar:
323*9880d681SAndroid Build Coastguard Worker    ; Dead code
324*9880d681SAndroid Build Coastguard Worker    ..
325*9880d681SAndroid Build Coastguard Worker
326*9880d681SAndroid Build Coastguard WorkerTherefore, it is recommended that ``NVVMReflect`` is executed early in the
327*9880d681SAndroid Build Coastguard Workeroptimization pipeline before dead-code elimination.
328*9880d681SAndroid Build Coastguard Worker
329*9880d681SAndroid Build Coastguard Worker
330*9880d681SAndroid Build Coastguard WorkerReflection Parameters
331*9880d681SAndroid Build Coastguard Worker---------------------
332*9880d681SAndroid Build Coastguard Worker
333*9880d681SAndroid Build Coastguard WorkerThe libdevice library currently uses the following reflection parameters to
334*9880d681SAndroid Build Coastguard Workercontrol code generation:
335*9880d681SAndroid Build Coastguard Worker
336*9880d681SAndroid Build Coastguard Worker==================== ======================================================
337*9880d681SAndroid Build Coastguard WorkerFlag                 Description
338*9880d681SAndroid Build Coastguard Worker==================== ======================================================
339*9880d681SAndroid Build Coastguard Worker``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero
340*9880d681SAndroid Build Coastguard Worker==================== ======================================================
341*9880d681SAndroid Build Coastguard Worker
342*9880d681SAndroid Build Coastguard Worker
343*9880d681SAndroid Build Coastguard WorkerInvoking NVVMReflect
344*9880d681SAndroid Build Coastguard Worker--------------------
345*9880d681SAndroid Build Coastguard Worker
346*9880d681SAndroid Build Coastguard WorkerTo ensure that all dead code caused by the reflection pass is eliminated, it
347*9880d681SAndroid Build Coastguard Workeris recommended that the reflection pass is executed early in the LLVM IR
348*9880d681SAndroid Build Coastguard Workeroptimization pipeline. The pass takes an optional mapping of reflection
349*9880d681SAndroid Build Coastguard Workerparameter name to an integer value. This mapping can be specified as either a
350*9880d681SAndroid Build Coastguard Workercommand-line option to ``opt`` or as an LLVM ``StringMap<int>`` object when
351*9880d681SAndroid Build Coastguard Workerprogrammatically creating a pass pipeline.
352*9880d681SAndroid Build Coastguard Worker
353*9880d681SAndroid Build Coastguard WorkerWith ``opt``:
354*9880d681SAndroid Build Coastguard Worker
355*9880d681SAndroid Build Coastguard Worker.. code-block:: text
356*9880d681SAndroid Build Coastguard Worker
357*9880d681SAndroid Build Coastguard Worker  # opt -nvvm-reflect -nvvm-reflect-list=<var>=<value>,<var>=<value> module.bc -o module.reflect.bc
358*9880d681SAndroid Build Coastguard Worker
359*9880d681SAndroid Build Coastguard Worker
360*9880d681SAndroid Build Coastguard WorkerWith programmatic pass pipeline:
361*9880d681SAndroid Build Coastguard Worker
362*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
363*9880d681SAndroid Build Coastguard Worker
364*9880d681SAndroid Build Coastguard Worker  extern FunctionPass *llvm::createNVVMReflectPass(const StringMap<int>& Mapping);
365*9880d681SAndroid Build Coastguard Worker
366*9880d681SAndroid Build Coastguard Worker  StringMap<int> ReflectParams;
367*9880d681SAndroid Build Coastguard Worker  ReflectParams["__CUDA_FTZ"] = 1;
368*9880d681SAndroid Build Coastguard Worker  Passes.add(createNVVMReflectPass(ReflectParams));
369*9880d681SAndroid Build Coastguard Worker
370*9880d681SAndroid Build Coastguard Worker
371*9880d681SAndroid Build Coastguard Worker
372*9880d681SAndroid Build Coastguard WorkerExecuting PTX
373*9880d681SAndroid Build Coastguard Worker=============
374*9880d681SAndroid Build Coastguard Worker
375*9880d681SAndroid Build Coastguard WorkerThe most common way to execute PTX assembly on a GPU device is to use the CUDA
376*9880d681SAndroid Build Coastguard WorkerDriver API. This API is a low-level interface to the GPU driver and allows for
377*9880d681SAndroid Build Coastguard WorkerJIT compilation of PTX code to native GPU machine code.
378*9880d681SAndroid Build Coastguard Worker
379*9880d681SAndroid Build Coastguard WorkerInitializing the Driver API:
380*9880d681SAndroid Build Coastguard Worker
381*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
382*9880d681SAndroid Build Coastguard Worker
383*9880d681SAndroid Build Coastguard Worker    CUdevice device;
384*9880d681SAndroid Build Coastguard Worker    CUcontext context;
385*9880d681SAndroid Build Coastguard Worker
386*9880d681SAndroid Build Coastguard Worker    // Initialize the driver API
387*9880d681SAndroid Build Coastguard Worker    cuInit(0);
388*9880d681SAndroid Build Coastguard Worker    // Get a handle to the first compute device
389*9880d681SAndroid Build Coastguard Worker    cuDeviceGet(&device, 0);
390*9880d681SAndroid Build Coastguard Worker    // Create a compute device context
391*9880d681SAndroid Build Coastguard Worker    cuCtxCreate(&context, 0, device);
392*9880d681SAndroid Build Coastguard Worker
393*9880d681SAndroid Build Coastguard WorkerJIT compiling a PTX string to a device binary:
394*9880d681SAndroid Build Coastguard Worker
395*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
396*9880d681SAndroid Build Coastguard Worker
397*9880d681SAndroid Build Coastguard Worker    CUmodule module;
398*9880d681SAndroid Build Coastguard Worker    CUfunction function;
399*9880d681SAndroid Build Coastguard Worker
400*9880d681SAndroid Build Coastguard Worker    // JIT compile a null-terminated PTX string
401*9880d681SAndroid Build Coastguard Worker    cuModuleLoadData(&module, (void*)PTXString);
402*9880d681SAndroid Build Coastguard Worker
403*9880d681SAndroid Build Coastguard Worker    // Get a handle to the "myfunction" kernel function
404*9880d681SAndroid Build Coastguard Worker    cuModuleGetFunction(&function, module, "myfunction");
405*9880d681SAndroid Build Coastguard Worker
406*9880d681SAndroid Build Coastguard WorkerFor full examples of executing PTX assembly, please see the `CUDA Samples
407*9880d681SAndroid Build Coastguard Worker<https://developer.nvidia.com/cuda-downloads>`_ distribution.
408*9880d681SAndroid Build Coastguard Worker
409*9880d681SAndroid Build Coastguard Worker
410*9880d681SAndroid Build Coastguard WorkerCommon Issues
411*9880d681SAndroid Build Coastguard Worker=============
412*9880d681SAndroid Build Coastguard Worker
413*9880d681SAndroid Build Coastguard Workerptxas complains of undefined function: __nvvm_reflect
414*9880d681SAndroid Build Coastguard Worker-----------------------------------------------------
415*9880d681SAndroid Build Coastguard Worker
416*9880d681SAndroid Build Coastguard WorkerWhen linking with libdevice, the ``NVVMReflect`` pass must be used. See
417*9880d681SAndroid Build Coastguard Worker:ref:`libdevice` for more information.
418*9880d681SAndroid Build Coastguard Worker
419*9880d681SAndroid Build Coastguard Worker
420*9880d681SAndroid Build Coastguard WorkerTutorial: A Simple Compute Kernel
421*9880d681SAndroid Build Coastguard Worker=================================
422*9880d681SAndroid Build Coastguard Worker
423*9880d681SAndroid Build Coastguard WorkerTo start, let us take a look at a simple compute kernel written directly in
424*9880d681SAndroid Build Coastguard WorkerLLVM IR. The kernel implements vector addition, where each thread computes one
425*9880d681SAndroid Build Coastguard Workerelement of the output vector C from the input vectors A and B.  To make this
426*9880d681SAndroid Build Coastguard Workereasier, we also assume that only a single CTA (thread block) will be launched,
427*9880d681SAndroid Build Coastguard Workerand that it will be one dimensional.
428*9880d681SAndroid Build Coastguard Worker
429*9880d681SAndroid Build Coastguard Worker
430*9880d681SAndroid Build Coastguard WorkerThe Kernel
431*9880d681SAndroid Build Coastguard Worker----------
432*9880d681SAndroid Build Coastguard Worker
433*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm
434*9880d681SAndroid Build Coastguard Worker
435*9880d681SAndroid Build Coastguard Worker  target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
436*9880d681SAndroid Build Coastguard Worker  target triple = "nvptx64-nvidia-cuda"
437*9880d681SAndroid Build Coastguard Worker
438*9880d681SAndroid Build Coastguard Worker  ; Intrinsic to read X component of thread ID
439*9880d681SAndroid Build Coastguard Worker  declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
440*9880d681SAndroid Build Coastguard Worker
441*9880d681SAndroid Build Coastguard Worker  define void @kernel(float addrspace(1)* %A,
442*9880d681SAndroid Build Coastguard Worker                      float addrspace(1)* %B,
443*9880d681SAndroid Build Coastguard Worker                      float addrspace(1)* %C) {
444*9880d681SAndroid Build Coastguard Worker  entry:
445*9880d681SAndroid Build Coastguard Worker    ; What is my ID?
446*9880d681SAndroid Build Coastguard Worker    %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
447*9880d681SAndroid Build Coastguard Worker
448*9880d681SAndroid Build Coastguard Worker    ; Compute pointers into A, B, and C
449*9880d681SAndroid Build Coastguard Worker    %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id
450*9880d681SAndroid Build Coastguard Worker    %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id
451*9880d681SAndroid Build Coastguard Worker    %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id
452*9880d681SAndroid Build Coastguard Worker
453*9880d681SAndroid Build Coastguard Worker    ; Read A, B
454*9880d681SAndroid Build Coastguard Worker    %valA = load float, float addrspace(1)* %ptrA, align 4
455*9880d681SAndroid Build Coastguard Worker    %valB = load float, float addrspace(1)* %ptrB, align 4
456*9880d681SAndroid Build Coastguard Worker
457*9880d681SAndroid Build Coastguard Worker    ; Compute C = A + B
458*9880d681SAndroid Build Coastguard Worker    %valC = fadd float %valA, %valB
459*9880d681SAndroid Build Coastguard Worker
460*9880d681SAndroid Build Coastguard Worker    ; Store back to C
461*9880d681SAndroid Build Coastguard Worker    store float %valC, float addrspace(1)* %ptrC, align 4
462*9880d681SAndroid Build Coastguard Worker
463*9880d681SAndroid Build Coastguard Worker    ret void
464*9880d681SAndroid Build Coastguard Worker  }
465*9880d681SAndroid Build Coastguard Worker
466*9880d681SAndroid Build Coastguard Worker  !nvvm.annotations = !{!0}
467*9880d681SAndroid Build Coastguard Worker  !0 = !{void (float addrspace(1)*,
468*9880d681SAndroid Build Coastguard Worker               float addrspace(1)*,
469*9880d681SAndroid Build Coastguard Worker               float addrspace(1)*)* @kernel, !"kernel", i32 1}
470*9880d681SAndroid Build Coastguard Worker
471*9880d681SAndroid Build Coastguard Worker
472*9880d681SAndroid Build Coastguard WorkerWe can use the LLVM ``llc`` tool to directly run the NVPTX code generator:
473*9880d681SAndroid Build Coastguard Worker
474*9880d681SAndroid Build Coastguard Worker.. code-block:: text
475*9880d681SAndroid Build Coastguard Worker
476*9880d681SAndroid Build Coastguard Worker  # llc -mcpu=sm_20 kernel.ll -o kernel.ptx
477*9880d681SAndroid Build Coastguard Worker
478*9880d681SAndroid Build Coastguard Worker
479*9880d681SAndroid Build Coastguard Worker.. note::
480*9880d681SAndroid Build Coastguard Worker
481*9880d681SAndroid Build Coastguard Worker  If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32``
482*9880d681SAndroid Build Coastguard Worker  in the module data layout string and use ``nvptx-nvidia-cuda`` as the
483*9880d681SAndroid Build Coastguard Worker  target triple.
484*9880d681SAndroid Build Coastguard Worker
485*9880d681SAndroid Build Coastguard Worker
486*9880d681SAndroid Build Coastguard WorkerThe output we get from ``llc`` (as of LLVM 3.4):
487*9880d681SAndroid Build Coastguard Worker
488*9880d681SAndroid Build Coastguard Worker.. code-block:: text
489*9880d681SAndroid Build Coastguard Worker
490*9880d681SAndroid Build Coastguard Worker  //
491*9880d681SAndroid Build Coastguard Worker  // Generated by LLVM NVPTX Back-End
492*9880d681SAndroid Build Coastguard Worker  //
493*9880d681SAndroid Build Coastguard Worker
494*9880d681SAndroid Build Coastguard Worker  .version 3.1
495*9880d681SAndroid Build Coastguard Worker  .target sm_20
496*9880d681SAndroid Build Coastguard Worker  .address_size 64
497*9880d681SAndroid Build Coastguard Worker
498*9880d681SAndroid Build Coastguard Worker    // .globl kernel
499*9880d681SAndroid Build Coastguard Worker                                          // @kernel
500*9880d681SAndroid Build Coastguard Worker  .visible .entry kernel(
501*9880d681SAndroid Build Coastguard Worker    .param .u64 kernel_param_0,
502*9880d681SAndroid Build Coastguard Worker    .param .u64 kernel_param_1,
503*9880d681SAndroid Build Coastguard Worker    .param .u64 kernel_param_2
504*9880d681SAndroid Build Coastguard Worker  )
505*9880d681SAndroid Build Coastguard Worker  {
506*9880d681SAndroid Build Coastguard Worker    .reg .f32   %f<4>;
507*9880d681SAndroid Build Coastguard Worker    .reg .s32   %r<2>;
508*9880d681SAndroid Build Coastguard Worker    .reg .s64   %rl<8>;
509*9880d681SAndroid Build Coastguard Worker
510*9880d681SAndroid Build Coastguard Worker  // BB#0:                                // %entry
511*9880d681SAndroid Build Coastguard Worker    ld.param.u64    %rl1, [kernel_param_0];
512*9880d681SAndroid Build Coastguard Worker    mov.u32         %r1, %tid.x;
513*9880d681SAndroid Build Coastguard Worker    mul.wide.s32    %rl2, %r1, 4;
514*9880d681SAndroid Build Coastguard Worker    add.s64         %rl3, %rl1, %rl2;
515*9880d681SAndroid Build Coastguard Worker    ld.param.u64    %rl4, [kernel_param_1];
516*9880d681SAndroid Build Coastguard Worker    add.s64         %rl5, %rl4, %rl2;
517*9880d681SAndroid Build Coastguard Worker    ld.param.u64    %rl6, [kernel_param_2];
518*9880d681SAndroid Build Coastguard Worker    add.s64         %rl7, %rl6, %rl2;
519*9880d681SAndroid Build Coastguard Worker    ld.global.f32   %f1, [%rl3];
520*9880d681SAndroid Build Coastguard Worker    ld.global.f32   %f2, [%rl5];
521*9880d681SAndroid Build Coastguard Worker    add.f32         %f3, %f1, %f2;
522*9880d681SAndroid Build Coastguard Worker    st.global.f32   [%rl7], %f3;
523*9880d681SAndroid Build Coastguard Worker    ret;
524*9880d681SAndroid Build Coastguard Worker  }
525*9880d681SAndroid Build Coastguard Worker
526*9880d681SAndroid Build Coastguard Worker
527*9880d681SAndroid Build Coastguard WorkerDissecting the Kernel
528*9880d681SAndroid Build Coastguard Worker---------------------
529*9880d681SAndroid Build Coastguard Worker
530*9880d681SAndroid Build Coastguard WorkerNow let us dissect the LLVM IR that makes up this kernel.
531*9880d681SAndroid Build Coastguard Worker
532*9880d681SAndroid Build Coastguard WorkerData Layout
533*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^
534*9880d681SAndroid Build Coastguard Worker
535*9880d681SAndroid Build Coastguard WorkerThe data layout string determines the size in bits of common data types, their
536*9880d681SAndroid Build Coastguard WorkerABI alignment, and their storage size.  For NVPTX, you should use one of the
537*9880d681SAndroid Build Coastguard Workerfollowing:
538*9880d681SAndroid Build Coastguard Worker
539*9880d681SAndroid Build Coastguard Worker32-bit PTX:
540*9880d681SAndroid Build Coastguard Worker
541*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm
542*9880d681SAndroid Build Coastguard Worker
543*9880d681SAndroid Build Coastguard Worker  target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
544*9880d681SAndroid Build Coastguard Worker
545*9880d681SAndroid Build Coastguard Worker64-bit PTX:
546*9880d681SAndroid Build Coastguard Worker
547*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm
548*9880d681SAndroid Build Coastguard Worker
549*9880d681SAndroid Build Coastguard Worker  target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
550*9880d681SAndroid Build Coastguard Worker
551*9880d681SAndroid Build Coastguard Worker
552*9880d681SAndroid Build Coastguard WorkerTarget Intrinsics
553*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^
554*9880d681SAndroid Build Coastguard Worker
555*9880d681SAndroid Build Coastguard WorkerIn this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to
556*9880d681SAndroid Build Coastguard Workerread the X component of the current thread's ID, which corresponds to a read
557*9880d681SAndroid Build Coastguard Workerof register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of
558*9880d681SAndroid Build Coastguard Workerintrinsics.  A short list is shown below; please see
559*9880d681SAndroid Build Coastguard Worker``include/llvm/IR/IntrinsicsNVVM.td`` for the full list.
560*9880d681SAndroid Build Coastguard Worker
561*9880d681SAndroid Build Coastguard Worker
562*9880d681SAndroid Build Coastguard Worker================================================ ====================
563*9880d681SAndroid Build Coastguard WorkerIntrinsic                                        CUDA Equivalent
564*9880d681SAndroid Build Coastguard Worker================================================ ====================
565*9880d681SAndroid Build Coastguard Worker``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}``     threadIdx.{x,y,z}
566*9880d681SAndroid Build Coastguard Worker``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}``   blockIdx.{x,y,z}
567*9880d681SAndroid Build Coastguard Worker``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}``    blockDim.{x,y,z}
568*9880d681SAndroid Build Coastguard Worker``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}``  gridDim.{x,y,z}
569*9880d681SAndroid Build Coastguard Worker``void @llvm.nvvm.barrier0()``                   __syncthreads()
570*9880d681SAndroid Build Coastguard Worker================================================ ====================
571*9880d681SAndroid Build Coastguard Worker
572*9880d681SAndroid Build Coastguard Worker
573*9880d681SAndroid Build Coastguard WorkerAddress Spaces
574*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^
575*9880d681SAndroid Build Coastguard Worker
576*9880d681SAndroid Build Coastguard WorkerYou may have noticed that all of the pointer types in the LLVM IR example had
577*9880d681SAndroid Build Coastguard Workeran explicit address space specifier. What is address space 1? NVIDIA GPU
578*9880d681SAndroid Build Coastguard Workerdevices (generally) have four types of memory:
579*9880d681SAndroid Build Coastguard Worker
580*9880d681SAndroid Build Coastguard Worker- Global: Large, off-chip memory
581*9880d681SAndroid Build Coastguard Worker- Shared: Small, on-chip memory shared among all threads in a CTA
582*9880d681SAndroid Build Coastguard Worker- Local: Per-thread, private memory
583*9880d681SAndroid Build Coastguard Worker- Constant: Read-only memory shared across all threads
584*9880d681SAndroid Build Coastguard Worker
585*9880d681SAndroid Build Coastguard WorkerThese different types of memory are represented in LLVM IR as address spaces.
586*9880d681SAndroid Build Coastguard WorkerThere is also a fifth address space used by the NVPTX code generator that
587*9880d681SAndroid Build Coastguard Workercorresponds to the "generic" address space.  This address space can represent
588*9880d681SAndroid Build Coastguard Workeraddresses in any other address space (with a few exceptions).  This allows
589*9880d681SAndroid Build Coastguard Workerusers to write IR functions that can load/store memory using the same
590*9880d681SAndroid Build Coastguard Workerinstructions. Intrinsics are provided to convert pointers between the generic
591*9880d681SAndroid Build Coastguard Workerand non-generic address spaces.
592*9880d681SAndroid Build Coastguard Worker
593*9880d681SAndroid Build Coastguard WorkerSee :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information.
594*9880d681SAndroid Build Coastguard Worker
595*9880d681SAndroid Build Coastguard Worker
596*9880d681SAndroid Build Coastguard WorkerKernel Metadata
597*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^
598*9880d681SAndroid Build Coastguard Worker
599*9880d681SAndroid Build Coastguard WorkerIn PTX, a function can be either a `kernel` function (callable from the host
600*9880d681SAndroid Build Coastguard Workerprogram), or a `device` function (callable only from GPU code). You can think
601*9880d681SAndroid Build Coastguard Workerof `kernel` functions as entry-points in the GPU program. To mark an LLVM IR
602*9880d681SAndroid Build Coastguard Workerfunction as a `kernel` function, we make use of special LLVM metadata. The
603*9880d681SAndroid Build Coastguard WorkerNVPTX back-end will look for a named metadata node called
604*9880d681SAndroid Build Coastguard Worker``nvvm.annotations``. This named metadata must contain a list of metadata that
605*9880d681SAndroid Build Coastguard Workerdescribe the IR. For our purposes, we need to declare a metadata node that
606*9880d681SAndroid Build Coastguard Workerassigns the "kernel" attribute to the LLVM IR function that should be emitted
607*9880d681SAndroid Build Coastguard Workeras a PTX `kernel` function. These metadata nodes take the form:
608*9880d681SAndroid Build Coastguard Worker
609*9880d681SAndroid Build Coastguard Worker.. code-block:: text
610*9880d681SAndroid Build Coastguard Worker
611*9880d681SAndroid Build Coastguard Worker  !{<function ref>, metadata !"kernel", i32 1}
612*9880d681SAndroid Build Coastguard Worker
613*9880d681SAndroid Build Coastguard WorkerFor the previous example, we have:
614*9880d681SAndroid Build Coastguard Worker
615*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm
616*9880d681SAndroid Build Coastguard Worker
617*9880d681SAndroid Build Coastguard Worker  !nvvm.annotations = !{!0}
618*9880d681SAndroid Build Coastguard Worker  !0 = !{void (float addrspace(1)*,
619*9880d681SAndroid Build Coastguard Worker               float addrspace(1)*,
620*9880d681SAndroid Build Coastguard Worker               float addrspace(1)*)* @kernel, !"kernel", i32 1}
621*9880d681SAndroid Build Coastguard Worker
622*9880d681SAndroid Build Coastguard WorkerHere, we have a single metadata declaration in ``nvvm.annotations``. This
623*9880d681SAndroid Build Coastguard Workermetadata annotates our ``@kernel`` function with the ``kernel`` attribute.
624*9880d681SAndroid Build Coastguard Worker
625*9880d681SAndroid Build Coastguard Worker
626*9880d681SAndroid Build Coastguard WorkerRunning the Kernel
627*9880d681SAndroid Build Coastguard Worker------------------
628*9880d681SAndroid Build Coastguard Worker
629*9880d681SAndroid Build Coastguard WorkerGenerating PTX from LLVM IR is all well and good, but how do we execute it on
630*9880d681SAndroid Build Coastguard Workera real GPU device? The CUDA Driver API provides a convenient mechanism for
631*9880d681SAndroid Build Coastguard Workerloading and JIT compiling PTX to a native GPU device, and launching a kernel.
632*9880d681SAndroid Build Coastguard WorkerThe API is similar to OpenCL.  A simple example showing how to load and
633*9880d681SAndroid Build Coastguard Workerexecute our vector addition code is shown below. Note that for brevity this
634*9880d681SAndroid Build Coastguard Workercode does not perform much error checking!
635*9880d681SAndroid Build Coastguard Worker
636*9880d681SAndroid Build Coastguard Worker.. note::
637*9880d681SAndroid Build Coastguard Worker
638*9880d681SAndroid Build Coastguard Worker  You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline
639*9880d681SAndroid Build Coastguard Worker  compile PTX to machine code (SASS) for a specific GPU architecture. Such
640*9880d681SAndroid Build Coastguard Worker  binaries can be loaded by the CUDA Driver API in the same way as PTX. This
641*9880d681SAndroid Build Coastguard Worker  can be useful for reducing startup time by precompiling the PTX kernels.
642*9880d681SAndroid Build Coastguard Worker
643*9880d681SAndroid Build Coastguard Worker
644*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
645*9880d681SAndroid Build Coastguard Worker
646*9880d681SAndroid Build Coastguard Worker  #include <iostream>
647*9880d681SAndroid Build Coastguard Worker  #include <fstream>
648*9880d681SAndroid Build Coastguard Worker  #include <cassert>
649*9880d681SAndroid Build Coastguard Worker  #include "cuda.h"
650*9880d681SAndroid Build Coastguard Worker
651*9880d681SAndroid Build Coastguard Worker
652*9880d681SAndroid Build Coastguard Worker  void checkCudaErrors(CUresult err) {
653*9880d681SAndroid Build Coastguard Worker    assert(err == CUDA_SUCCESS);
654*9880d681SAndroid Build Coastguard Worker  }
655*9880d681SAndroid Build Coastguard Worker
656*9880d681SAndroid Build Coastguard Worker  /// main - Program entry point
657*9880d681SAndroid Build Coastguard Worker  int main(int argc, char **argv) {
658*9880d681SAndroid Build Coastguard Worker    CUdevice    device;
659*9880d681SAndroid Build Coastguard Worker    CUmodule    cudaModule;
660*9880d681SAndroid Build Coastguard Worker    CUcontext   context;
661*9880d681SAndroid Build Coastguard Worker    CUfunction  function;
662*9880d681SAndroid Build Coastguard Worker    CUlinkState linker;
663*9880d681SAndroid Build Coastguard Worker    int         devCount;
664*9880d681SAndroid Build Coastguard Worker
665*9880d681SAndroid Build Coastguard Worker    // CUDA initialization
666*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuInit(0));
667*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuDeviceGetCount(&devCount));
668*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuDeviceGet(&device, 0));
669*9880d681SAndroid Build Coastguard Worker
670*9880d681SAndroid Build Coastguard Worker    char name[128];
671*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuDeviceGetName(name, 128, device));
672*9880d681SAndroid Build Coastguard Worker    std::cout << "Using CUDA Device [0]: " << name << "\n";
673*9880d681SAndroid Build Coastguard Worker
674*9880d681SAndroid Build Coastguard Worker    int devMajor, devMinor;
675*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
676*9880d681SAndroid Build Coastguard Worker    std::cout << "Device Compute Capability: "
677*9880d681SAndroid Build Coastguard Worker              << devMajor << "." << devMinor << "\n";
678*9880d681SAndroid Build Coastguard Worker    if (devMajor < 2) {
679*9880d681SAndroid Build Coastguard Worker      std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
680*9880d681SAndroid Build Coastguard Worker      return 1;
681*9880d681SAndroid Build Coastguard Worker    }
682*9880d681SAndroid Build Coastguard Worker
683*9880d681SAndroid Build Coastguard Worker    std::ifstream t("kernel.ptx");
684*9880d681SAndroid Build Coastguard Worker    if (!t.is_open()) {
685*9880d681SAndroid Build Coastguard Worker      std::cerr << "kernel.ptx not found\n";
686*9880d681SAndroid Build Coastguard Worker      return 1;
687*9880d681SAndroid Build Coastguard Worker    }
688*9880d681SAndroid Build Coastguard Worker    std::string str((std::istreambuf_iterator<char>(t)),
689*9880d681SAndroid Build Coastguard Worker                      std::istreambuf_iterator<char>());
690*9880d681SAndroid Build Coastguard Worker
691*9880d681SAndroid Build Coastguard Worker    // Create driver context
692*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuCtxCreate(&context, 0, device));
693*9880d681SAndroid Build Coastguard Worker
694*9880d681SAndroid Build Coastguard Worker    // Create module for object
695*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0));
696*9880d681SAndroid Build Coastguard Worker
697*9880d681SAndroid Build Coastguard Worker    // Get kernel function
698*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel"));
699*9880d681SAndroid Build Coastguard Worker
700*9880d681SAndroid Build Coastguard Worker    // Device data
701*9880d681SAndroid Build Coastguard Worker    CUdeviceptr devBufferA;
702*9880d681SAndroid Build Coastguard Worker    CUdeviceptr devBufferB;
703*9880d681SAndroid Build Coastguard Worker    CUdeviceptr devBufferC;
704*9880d681SAndroid Build Coastguard Worker
705*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16));
706*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16));
707*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16));
708*9880d681SAndroid Build Coastguard Worker
709*9880d681SAndroid Build Coastguard Worker    float* hostA = new float[16];
710*9880d681SAndroid Build Coastguard Worker    float* hostB = new float[16];
711*9880d681SAndroid Build Coastguard Worker    float* hostC = new float[16];
712*9880d681SAndroid Build Coastguard Worker
713*9880d681SAndroid Build Coastguard Worker    // Populate input
714*9880d681SAndroid Build Coastguard Worker    for (unsigned i = 0; i != 16; ++i) {
715*9880d681SAndroid Build Coastguard Worker      hostA[i] = (float)i;
716*9880d681SAndroid Build Coastguard Worker      hostB[i] = (float)(2*i);
717*9880d681SAndroid Build Coastguard Worker      hostC[i] = 0.0f;
718*9880d681SAndroid Build Coastguard Worker    }
719*9880d681SAndroid Build Coastguard Worker
720*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16));
721*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16));
722*9880d681SAndroid Build Coastguard Worker
723*9880d681SAndroid Build Coastguard Worker
724*9880d681SAndroid Build Coastguard Worker    unsigned blockSizeX = 16;
725*9880d681SAndroid Build Coastguard Worker    unsigned blockSizeY = 1;
726*9880d681SAndroid Build Coastguard Worker    unsigned blockSizeZ = 1;
727*9880d681SAndroid Build Coastguard Worker    unsigned gridSizeX  = 1;
728*9880d681SAndroid Build Coastguard Worker    unsigned gridSizeY  = 1;
729*9880d681SAndroid Build Coastguard Worker    unsigned gridSizeZ  = 1;
730*9880d681SAndroid Build Coastguard Worker
731*9880d681SAndroid Build Coastguard Worker    // Kernel parameters
732*9880d681SAndroid Build Coastguard Worker    void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC };
733*9880d681SAndroid Build Coastguard Worker
734*9880d681SAndroid Build Coastguard Worker    std::cout << "Launching kernel\n";
735*9880d681SAndroid Build Coastguard Worker
736*9880d681SAndroid Build Coastguard Worker    // Kernel launch
737*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ,
738*9880d681SAndroid Build Coastguard Worker                                   blockSizeX, blockSizeY, blockSizeZ,
739*9880d681SAndroid Build Coastguard Worker                                   0, NULL, KernelParams, NULL));
740*9880d681SAndroid Build Coastguard Worker
741*9880d681SAndroid Build Coastguard Worker    // Retrieve device data
742*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16));
743*9880d681SAndroid Build Coastguard Worker
744*9880d681SAndroid Build Coastguard Worker
745*9880d681SAndroid Build Coastguard Worker    std::cout << "Results:\n";
746*9880d681SAndroid Build Coastguard Worker    for (unsigned i = 0; i != 16; ++i) {
747*9880d681SAndroid Build Coastguard Worker      std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n";
748*9880d681SAndroid Build Coastguard Worker    }
749*9880d681SAndroid Build Coastguard Worker
750*9880d681SAndroid Build Coastguard Worker
751*9880d681SAndroid Build Coastguard Worker    // Clean up after ourselves
752*9880d681SAndroid Build Coastguard Worker    delete [] hostA;
753*9880d681SAndroid Build Coastguard Worker    delete [] hostB;
754*9880d681SAndroid Build Coastguard Worker    delete [] hostC;
755*9880d681SAndroid Build Coastguard Worker
756*9880d681SAndroid Build Coastguard Worker    // Clean-up
757*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuMemFree(devBufferA));
758*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuMemFree(devBufferB));
759*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuMemFree(devBufferC));
760*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuModuleUnload(cudaModule));
761*9880d681SAndroid Build Coastguard Worker    checkCudaErrors(cuCtxDestroy(context));
762*9880d681SAndroid Build Coastguard Worker
763*9880d681SAndroid Build Coastguard Worker    return 0;
764*9880d681SAndroid Build Coastguard Worker  }
765*9880d681SAndroid Build Coastguard Worker
766*9880d681SAndroid Build Coastguard Worker
767*9880d681SAndroid Build Coastguard WorkerYou will need to link with the CUDA driver and specify the path to cuda.h.
768*9880d681SAndroid Build Coastguard Worker
769*9880d681SAndroid Build Coastguard Worker.. code-block:: text
770*9880d681SAndroid Build Coastguard Worker
771*9880d681SAndroid Build Coastguard Worker  # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda
772*9880d681SAndroid Build Coastguard Worker
773*9880d681SAndroid Build Coastguard WorkerWe don't need to specify a path to ``libcuda.so`` since this is installed in a
774*9880d681SAndroid Build Coastguard Workersystem location by the driver, not the CUDA toolkit.
775*9880d681SAndroid Build Coastguard Worker
776*9880d681SAndroid Build Coastguard WorkerIf everything goes as planned, you should see the following output when
777*9880d681SAndroid Build Coastguard Workerrunning the compiled program:
778*9880d681SAndroid Build Coastguard Worker
779*9880d681SAndroid Build Coastguard Worker.. code-block:: text
780*9880d681SAndroid Build Coastguard Worker
781*9880d681SAndroid Build Coastguard Worker  Using CUDA Device [0]: GeForce GTX 680
782*9880d681SAndroid Build Coastguard Worker  Device Compute Capability: 3.0
783*9880d681SAndroid Build Coastguard Worker  Launching kernel
784*9880d681SAndroid Build Coastguard Worker  Results:
785*9880d681SAndroid Build Coastguard Worker  0 + 0 = 0
786*9880d681SAndroid Build Coastguard Worker  1 + 2 = 3
787*9880d681SAndroid Build Coastguard Worker  2 + 4 = 6
788*9880d681SAndroid Build Coastguard Worker  3 + 6 = 9
789*9880d681SAndroid Build Coastguard Worker  4 + 8 = 12
790*9880d681SAndroid Build Coastguard Worker  5 + 10 = 15
791*9880d681SAndroid Build Coastguard Worker  6 + 12 = 18
792*9880d681SAndroid Build Coastguard Worker  7 + 14 = 21
793*9880d681SAndroid Build Coastguard Worker  8 + 16 = 24
794*9880d681SAndroid Build Coastguard Worker  9 + 18 = 27
795*9880d681SAndroid Build Coastguard Worker  10 + 20 = 30
796*9880d681SAndroid Build Coastguard Worker  11 + 22 = 33
797*9880d681SAndroid Build Coastguard Worker  12 + 24 = 36
798*9880d681SAndroid Build Coastguard Worker  13 + 26 = 39
799*9880d681SAndroid Build Coastguard Worker  14 + 28 = 42
800*9880d681SAndroid Build Coastguard Worker  15 + 30 = 45
801*9880d681SAndroid Build Coastguard Worker
802*9880d681SAndroid Build Coastguard Worker.. note::
803*9880d681SAndroid Build Coastguard Worker
804*9880d681SAndroid Build Coastguard Worker  You will likely see a different device identifier based on your hardware
805*9880d681SAndroid Build Coastguard Worker
806*9880d681SAndroid Build Coastguard Worker
807*9880d681SAndroid Build Coastguard WorkerTutorial: Linking with Libdevice
808*9880d681SAndroid Build Coastguard Worker================================
809*9880d681SAndroid Build Coastguard Worker
810*9880d681SAndroid Build Coastguard WorkerIn this tutorial, we show a simple example of linking LLVM IR with the
811*9880d681SAndroid Build Coastguard Workerlibdevice library. We will use the same kernel as the previous tutorial,
812*9880d681SAndroid Build Coastguard Workerexcept that we will compute ``C = pow(A, B)`` instead of ``C = A + B``.
813*9880d681SAndroid Build Coastguard WorkerLibdevice provides an ``__nv_powf`` function that we will use.
814*9880d681SAndroid Build Coastguard Worker
815*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm
816*9880d681SAndroid Build Coastguard Worker
817*9880d681SAndroid Build Coastguard Worker  target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
818*9880d681SAndroid Build Coastguard Worker  target triple = "nvptx64-nvidia-cuda"
819*9880d681SAndroid Build Coastguard Worker
820*9880d681SAndroid Build Coastguard Worker  ; Intrinsic to read X component of thread ID
821*9880d681SAndroid Build Coastguard Worker  declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
822*9880d681SAndroid Build Coastguard Worker  ; libdevice function
823*9880d681SAndroid Build Coastguard Worker  declare float @__nv_powf(float, float)
824*9880d681SAndroid Build Coastguard Worker
825*9880d681SAndroid Build Coastguard Worker  define void @kernel(float addrspace(1)* %A,
826*9880d681SAndroid Build Coastguard Worker                      float addrspace(1)* %B,
827*9880d681SAndroid Build Coastguard Worker                      float addrspace(1)* %C) {
828*9880d681SAndroid Build Coastguard Worker  entry:
829*9880d681SAndroid Build Coastguard Worker    ; What is my ID?
830*9880d681SAndroid Build Coastguard Worker    %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
831*9880d681SAndroid Build Coastguard Worker
832*9880d681SAndroid Build Coastguard Worker    ; Compute pointers into A, B, and C
833*9880d681SAndroid Build Coastguard Worker    %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id
834*9880d681SAndroid Build Coastguard Worker    %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id
835*9880d681SAndroid Build Coastguard Worker    %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id
836*9880d681SAndroid Build Coastguard Worker
837*9880d681SAndroid Build Coastguard Worker    ; Read A, B
838*9880d681SAndroid Build Coastguard Worker    %valA = load float, float addrspace(1)* %ptrA, align 4
839*9880d681SAndroid Build Coastguard Worker    %valB = load float, float addrspace(1)* %ptrB, align 4
840*9880d681SAndroid Build Coastguard Worker
841*9880d681SAndroid Build Coastguard Worker    ; Compute C = pow(A, B)
842*9880d681SAndroid Build Coastguard Worker    %valC = call float @__nv_powf(float %valA, float %valB)
843*9880d681SAndroid Build Coastguard Worker
844*9880d681SAndroid Build Coastguard Worker    ; Store back to C
845*9880d681SAndroid Build Coastguard Worker    store float %valC, float addrspace(1)* %ptrC, align 4
846*9880d681SAndroid Build Coastguard Worker
847*9880d681SAndroid Build Coastguard Worker    ret void
848*9880d681SAndroid Build Coastguard Worker  }
849*9880d681SAndroid Build Coastguard Worker
850*9880d681SAndroid Build Coastguard Worker  !nvvm.annotations = !{!0}
851*9880d681SAndroid Build Coastguard Worker  !0 = !{void (float addrspace(1)*,
852*9880d681SAndroid Build Coastguard Worker               float addrspace(1)*,
853*9880d681SAndroid Build Coastguard Worker               float addrspace(1)*)* @kernel, !"kernel", i32 1}
854*9880d681SAndroid Build Coastguard Worker
855*9880d681SAndroid Build Coastguard Worker
856*9880d681SAndroid Build Coastguard WorkerTo compile this kernel, we perform the following steps:
857*9880d681SAndroid Build Coastguard Worker
858*9880d681SAndroid Build Coastguard Worker1. Link with libdevice
859*9880d681SAndroid Build Coastguard Worker2. Internalize all but the public kernel function
860*9880d681SAndroid Build Coastguard Worker3. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0
861*9880d681SAndroid Build Coastguard Worker4. Optimize the linked module
862*9880d681SAndroid Build Coastguard Worker5. Codegen the module
863*9880d681SAndroid Build Coastguard Worker
864*9880d681SAndroid Build Coastguard Worker
865*9880d681SAndroid Build Coastguard WorkerThese steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc``
866*9880d681SAndroid Build Coastguard Workertools. In a complete compiler, these steps can also be performed entirely
867*9880d681SAndroid Build Coastguard Workerprogrammatically by setting up an appropriate pass configuration (see
868*9880d681SAndroid Build Coastguard Worker:ref:`libdevice`).
869*9880d681SAndroid Build Coastguard Worker
870*9880d681SAndroid Build Coastguard Worker.. code-block:: text
871*9880d681SAndroid Build Coastguard Worker
872*9880d681SAndroid Build Coastguard Worker  # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc
873*9880d681SAndroid Build Coastguard Worker  # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc
874*9880d681SAndroid Build Coastguard Worker  # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx
875*9880d681SAndroid Build Coastguard Worker
876*9880d681SAndroid Build Coastguard Worker.. note::
877*9880d681SAndroid Build Coastguard Worker
878*9880d681SAndroid Build Coastguard Worker  The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any
879*9880d681SAndroid Build Coastguard Worker  undefined variables will default to zero. It is shown here for evaluation
880*9880d681SAndroid Build Coastguard Worker  purposes.
881*9880d681SAndroid Build Coastguard Worker
882*9880d681SAndroid Build Coastguard Worker
883*9880d681SAndroid Build Coastguard WorkerThis gives us the following PTX (excerpt):
884*9880d681SAndroid Build Coastguard Worker
885*9880d681SAndroid Build Coastguard Worker.. code-block:: text
886*9880d681SAndroid Build Coastguard Worker
887*9880d681SAndroid Build Coastguard Worker  //
888*9880d681SAndroid Build Coastguard Worker  // Generated by LLVM NVPTX Back-End
889*9880d681SAndroid Build Coastguard Worker  //
890*9880d681SAndroid Build Coastguard Worker
891*9880d681SAndroid Build Coastguard Worker  .version 3.1
892*9880d681SAndroid Build Coastguard Worker  .target sm_20
893*9880d681SAndroid Build Coastguard Worker  .address_size 64
894*9880d681SAndroid Build Coastguard Worker
895*9880d681SAndroid Build Coastguard Worker    // .globl kernel
896*9880d681SAndroid Build Coastguard Worker                                          // @kernel
897*9880d681SAndroid Build Coastguard Worker  .visible .entry kernel(
898*9880d681SAndroid Build Coastguard Worker    .param .u64 kernel_param_0,
899*9880d681SAndroid Build Coastguard Worker    .param .u64 kernel_param_1,
900*9880d681SAndroid Build Coastguard Worker    .param .u64 kernel_param_2
901*9880d681SAndroid Build Coastguard Worker  )
902*9880d681SAndroid Build Coastguard Worker  {
903*9880d681SAndroid Build Coastguard Worker    .reg .pred  %p<30>;
904*9880d681SAndroid Build Coastguard Worker    .reg .f32   %f<111>;
905*9880d681SAndroid Build Coastguard Worker    .reg .s32   %r<21>;
906*9880d681SAndroid Build Coastguard Worker    .reg .s64   %rl<8>;
907*9880d681SAndroid Build Coastguard Worker
908*9880d681SAndroid Build Coastguard Worker  // BB#0:                                // %entry
909*9880d681SAndroid Build Coastguard Worker    ld.param.u64  %rl2, [kernel_param_0];
910*9880d681SAndroid Build Coastguard Worker    mov.u32   %r3, %tid.x;
911*9880d681SAndroid Build Coastguard Worker    ld.param.u64  %rl3, [kernel_param_1];
912*9880d681SAndroid Build Coastguard Worker    mul.wide.s32  %rl4, %r3, 4;
913*9880d681SAndroid Build Coastguard Worker    add.s64   %rl5, %rl2, %rl4;
914*9880d681SAndroid Build Coastguard Worker    ld.param.u64  %rl6, [kernel_param_2];
915*9880d681SAndroid Build Coastguard Worker    add.s64   %rl7, %rl3, %rl4;
916*9880d681SAndroid Build Coastguard Worker    add.s64   %rl1, %rl6, %rl4;
917*9880d681SAndroid Build Coastguard Worker    ld.global.f32   %f1, [%rl5];
918*9880d681SAndroid Build Coastguard Worker    ld.global.f32   %f2, [%rl7];
919*9880d681SAndroid Build Coastguard Worker    setp.eq.f32 %p1, %f1, 0f3F800000;
920*9880d681SAndroid Build Coastguard Worker    setp.eq.f32 %p2, %f2, 0f00000000;
921*9880d681SAndroid Build Coastguard Worker    or.pred   %p3, %p1, %p2;
922*9880d681SAndroid Build Coastguard Worker    @%p3 bra  BB0_1;
923*9880d681SAndroid Build Coastguard Worker    bra.uni   BB0_2;
924*9880d681SAndroid Build Coastguard Worker  BB0_1:
925*9880d681SAndroid Build Coastguard Worker    mov.f32   %f110, 0f3F800000;
926*9880d681SAndroid Build Coastguard Worker    st.global.f32   [%rl1], %f110;
927*9880d681SAndroid Build Coastguard Worker    ret;
928*9880d681SAndroid Build Coastguard Worker  BB0_2:                                  // %__nv_isnanf.exit.i
929*9880d681SAndroid Build Coastguard Worker    abs.f32   %f4, %f1;
930*9880d681SAndroid Build Coastguard Worker    setp.gtu.f32  %p4, %f4, 0f7F800000;
931*9880d681SAndroid Build Coastguard Worker    @%p4 bra  BB0_4;
932*9880d681SAndroid Build Coastguard Worker  // BB#3:                                // %__nv_isnanf.exit5.i
933*9880d681SAndroid Build Coastguard Worker    abs.f32   %f5, %f2;
934*9880d681SAndroid Build Coastguard Worker    setp.le.f32 %p5, %f5, 0f7F800000;
935*9880d681SAndroid Build Coastguard Worker    @%p5 bra  BB0_5;
936*9880d681SAndroid Build Coastguard Worker  BB0_4:                                  // %.critedge1.i
937*9880d681SAndroid Build Coastguard Worker    add.f32   %f110, %f1, %f2;
938*9880d681SAndroid Build Coastguard Worker    st.global.f32   [%rl1], %f110;
939*9880d681SAndroid Build Coastguard Worker    ret;
940*9880d681SAndroid Build Coastguard Worker  BB0_5:                                  // %__nv_isinff.exit.i
941*9880d681SAndroid Build Coastguard Worker
942*9880d681SAndroid Build Coastguard Worker    ...
943*9880d681SAndroid Build Coastguard Worker
944*9880d681SAndroid Build Coastguard Worker  BB0_26:                                 // %__nv_truncf.exit.i.i.i.i.i
945*9880d681SAndroid Build Coastguard Worker    mul.f32   %f90, %f107, 0f3FB8AA3B;
946*9880d681SAndroid Build Coastguard Worker    cvt.rzi.f32.f32 %f91, %f90;
947*9880d681SAndroid Build Coastguard Worker    mov.f32   %f92, 0fBF317200;
948*9880d681SAndroid Build Coastguard Worker    fma.rn.f32  %f93, %f91, %f92, %f107;
949*9880d681SAndroid Build Coastguard Worker    mov.f32   %f94, 0fB5BFBE8E;
950*9880d681SAndroid Build Coastguard Worker    fma.rn.f32  %f95, %f91, %f94, %f93;
951*9880d681SAndroid Build Coastguard Worker    mul.f32   %f89, %f95, 0f3FB8AA3B;
952*9880d681SAndroid Build Coastguard Worker    // inline asm
953*9880d681SAndroid Build Coastguard Worker    ex2.approx.ftz.f32 %f88,%f89;
954*9880d681SAndroid Build Coastguard Worker    // inline asm
955*9880d681SAndroid Build Coastguard Worker    add.f32   %f96, %f91, 0f00000000;
956*9880d681SAndroid Build Coastguard Worker    ex2.approx.f32  %f97, %f96;
957*9880d681SAndroid Build Coastguard Worker    mul.f32   %f98, %f88, %f97;
958*9880d681SAndroid Build Coastguard Worker    setp.lt.f32 %p15, %f107, 0fC2D20000;
959*9880d681SAndroid Build Coastguard Worker    selp.f32  %f99, 0f00000000, %f98, %p15;
960*9880d681SAndroid Build Coastguard Worker    setp.gt.f32 %p16, %f107, 0f42D20000;
961*9880d681SAndroid Build Coastguard Worker    selp.f32  %f110, 0f7F800000, %f99, %p16;
962*9880d681SAndroid Build Coastguard Worker    setp.eq.f32 %p17, %f110, 0f7F800000;
963*9880d681SAndroid Build Coastguard Worker    @%p17 bra   BB0_28;
964*9880d681SAndroid Build Coastguard Worker  // BB#27:
965*9880d681SAndroid Build Coastguard Worker    fma.rn.f32  %f110, %f110, %f108, %f110;
966*9880d681SAndroid Build Coastguard Worker  BB0_28:                                 // %__internal_accurate_powf.exit.i
967*9880d681SAndroid Build Coastguard Worker    setp.lt.f32 %p18, %f1, 0f00000000;
968*9880d681SAndroid Build Coastguard Worker    setp.eq.f32 %p19, %f3, 0f3F800000;
969*9880d681SAndroid Build Coastguard Worker    and.pred    %p20, %p18, %p19;
970*9880d681SAndroid Build Coastguard Worker    @!%p20 bra  BB0_30;
971*9880d681SAndroid Build Coastguard Worker    bra.uni   BB0_29;
972*9880d681SAndroid Build Coastguard Worker  BB0_29:
973*9880d681SAndroid Build Coastguard Worker    mov.b32    %r9, %f110;
974*9880d681SAndroid Build Coastguard Worker    xor.b32   %r10, %r9, -2147483648;
975*9880d681SAndroid Build Coastguard Worker    mov.b32    %f110, %r10;
976*9880d681SAndroid Build Coastguard Worker  BB0_30:                                 // %__nv_powf.exit
977*9880d681SAndroid Build Coastguard Worker    st.global.f32   [%rl1], %f110;
978*9880d681SAndroid Build Coastguard Worker    ret;
979*9880d681SAndroid Build Coastguard Worker  }
980*9880d681SAndroid Build Coastguard Worker
981