1*9880d681SAndroid Build Coastguard Worker============================= 2*9880d681SAndroid Build Coastguard WorkerUser Guide for NVPTX Back-end 3*9880d681SAndroid Build Coastguard Worker============================= 4*9880d681SAndroid Build Coastguard Worker 5*9880d681SAndroid Build Coastguard Worker.. contents:: 6*9880d681SAndroid Build Coastguard Worker :local: 7*9880d681SAndroid Build Coastguard Worker :depth: 3 8*9880d681SAndroid Build Coastguard Worker 9*9880d681SAndroid Build Coastguard Worker 10*9880d681SAndroid Build Coastguard WorkerIntroduction 11*9880d681SAndroid Build Coastguard Worker============ 12*9880d681SAndroid Build Coastguard Worker 13*9880d681SAndroid Build Coastguard WorkerTo support GPU programming, the NVPTX back-end supports a subset of LLVM IR 14*9880d681SAndroid Build Coastguard Workeralong with a defined set of conventions used to represent GPU programming 15*9880d681SAndroid Build Coastguard Workerconcepts. This document provides an overview of the general usage of the back- 16*9880d681SAndroid Build Coastguard Workerend, including a description of the conventions used and the set of accepted 17*9880d681SAndroid Build Coastguard WorkerLLVM IR. 18*9880d681SAndroid Build Coastguard Worker 19*9880d681SAndroid Build Coastguard Worker.. note:: 20*9880d681SAndroid Build Coastguard Worker 21*9880d681SAndroid Build Coastguard Worker This document assumes a basic familiarity with CUDA and the PTX 22*9880d681SAndroid Build Coastguard Worker assembly language. Information about the CUDA Driver API and the PTX assembly 23*9880d681SAndroid Build Coastguard Worker language can be found in the `CUDA documentation 24*9880d681SAndroid Build Coastguard Worker <http://docs.nvidia.com/cuda/index.html>`_. 25*9880d681SAndroid Build Coastguard Worker 26*9880d681SAndroid Build Coastguard Worker 27*9880d681SAndroid Build Coastguard Worker 28*9880d681SAndroid Build Coastguard WorkerConventions 29*9880d681SAndroid Build Coastguard Worker=========== 30*9880d681SAndroid Build Coastguard Worker 31*9880d681SAndroid Build Coastguard WorkerMarking Functions as Kernels 32*9880d681SAndroid Build Coastguard Worker---------------------------- 33*9880d681SAndroid Build Coastguard Worker 34*9880d681SAndroid Build Coastguard WorkerIn PTX, there are two types of functions: *device functions*, which are only 35*9880d681SAndroid Build Coastguard Workercallable by device code, and *kernel functions*, which are callable by host 36*9880d681SAndroid Build Coastguard Workercode. By default, the back-end will emit device functions. Metadata is used to 37*9880d681SAndroid Build Coastguard Workerdeclare a function as a kernel function. This metadata is attached to the 38*9880d681SAndroid Build Coastguard Worker``nvvm.annotations`` named metadata object, and has the following format: 39*9880d681SAndroid Build Coastguard Worker 40*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm 41*9880d681SAndroid Build Coastguard Worker 42*9880d681SAndroid Build Coastguard Worker !0 = !{<function-ref>, metadata !"kernel", i32 1} 43*9880d681SAndroid Build Coastguard Worker 44*9880d681SAndroid Build Coastguard WorkerThe first parameter is a reference to the kernel function. The following 45*9880d681SAndroid Build Coastguard Workerexample shows a kernel function calling a device function in LLVM IR. The 46*9880d681SAndroid Build Coastguard Workerfunction ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not. 47*9880d681SAndroid Build Coastguard Worker 48*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm 49*9880d681SAndroid Build Coastguard Worker 50*9880d681SAndroid Build Coastguard Worker define float @my_fmad(float %x, float %y, float %z) { 51*9880d681SAndroid Build Coastguard Worker %mul = fmul float %x, %y 52*9880d681SAndroid Build Coastguard Worker %add = fadd float %mul, %z 53*9880d681SAndroid Build Coastguard Worker ret float %add 54*9880d681SAndroid Build Coastguard Worker } 55*9880d681SAndroid Build Coastguard Worker 56*9880d681SAndroid Build Coastguard Worker define void @my_kernel(float* %ptr) { 57*9880d681SAndroid Build Coastguard Worker %val = load float, float* %ptr 58*9880d681SAndroid Build Coastguard Worker %ret = call float @my_fmad(float %val, float %val, float %val) 59*9880d681SAndroid Build Coastguard Worker store float %ret, float* %ptr 60*9880d681SAndroid Build Coastguard Worker ret void 61*9880d681SAndroid Build Coastguard Worker } 62*9880d681SAndroid Build Coastguard Worker 63*9880d681SAndroid Build Coastguard Worker !nvvm.annotations = !{!1} 64*9880d681SAndroid Build Coastguard Worker !1 = !{void (float*)* @my_kernel, !"kernel", i32 1} 65*9880d681SAndroid Build Coastguard Worker 66*9880d681SAndroid Build Coastguard WorkerWhen compiled, the PTX kernel functions are callable by host-side code. 67*9880d681SAndroid Build Coastguard Worker 68*9880d681SAndroid Build Coastguard Worker 69*9880d681SAndroid Build Coastguard Worker.. _address_spaces: 70*9880d681SAndroid Build Coastguard Worker 71*9880d681SAndroid Build Coastguard WorkerAddress Spaces 72*9880d681SAndroid Build Coastguard Worker-------------- 73*9880d681SAndroid Build Coastguard Worker 74*9880d681SAndroid Build Coastguard WorkerThe NVPTX back-end uses the following address space mapping: 75*9880d681SAndroid Build Coastguard Worker 76*9880d681SAndroid Build Coastguard Worker ============= ====================== 77*9880d681SAndroid Build Coastguard Worker Address Space Memory Space 78*9880d681SAndroid Build Coastguard Worker ============= ====================== 79*9880d681SAndroid Build Coastguard Worker 0 Generic 80*9880d681SAndroid Build Coastguard Worker 1 Global 81*9880d681SAndroid Build Coastguard Worker 2 Internal Use 82*9880d681SAndroid Build Coastguard Worker 3 Shared 83*9880d681SAndroid Build Coastguard Worker 4 Constant 84*9880d681SAndroid Build Coastguard Worker 5 Local 85*9880d681SAndroid Build Coastguard Worker ============= ====================== 86*9880d681SAndroid Build Coastguard Worker 87*9880d681SAndroid Build Coastguard WorkerEvery global variable and pointer type is assigned to one of these address 88*9880d681SAndroid Build Coastguard Workerspaces, with 0 being the default address space. Intrinsics are provided which 89*9880d681SAndroid Build Coastguard Workercan be used to convert pointers between the generic and non-generic address 90*9880d681SAndroid Build Coastguard Workerspaces. 91*9880d681SAndroid Build Coastguard Worker 92*9880d681SAndroid Build Coastguard WorkerAs an example, the following IR will define an array ``@g`` that resides in 93*9880d681SAndroid Build Coastguard Workerglobal device memory. 94*9880d681SAndroid Build Coastguard Worker 95*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm 96*9880d681SAndroid Build Coastguard Worker 97*9880d681SAndroid Build Coastguard Worker @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ] 98*9880d681SAndroid Build Coastguard Worker 99*9880d681SAndroid Build Coastguard WorkerLLVM IR functions can read and write to this array, and host-side code can 100*9880d681SAndroid Build Coastguard Workercopy data to it by name with the CUDA Driver API. 101*9880d681SAndroid Build Coastguard Worker 102*9880d681SAndroid Build Coastguard WorkerNote that since address space 0 is the generic space, it is illegal to have 103*9880d681SAndroid Build Coastguard Workerglobal variables in address space 0. Address space 0 is the default address 104*9880d681SAndroid Build Coastguard Workerspace in LLVM, so the ``addrspace(N)`` annotation is *required* for global 105*9880d681SAndroid Build Coastguard Workervariables. 106*9880d681SAndroid Build Coastguard Worker 107*9880d681SAndroid Build Coastguard Worker 108*9880d681SAndroid Build Coastguard WorkerTriples 109*9880d681SAndroid Build Coastguard Worker------- 110*9880d681SAndroid Build Coastguard Worker 111*9880d681SAndroid Build Coastguard WorkerThe NVPTX target uses the module triple to select between 32/64-bit code 112*9880d681SAndroid Build Coastguard Workergeneration and the driver-compiler interface to use. The triple architecture 113*9880d681SAndroid Build Coastguard Workercan be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The 114*9880d681SAndroid Build Coastguard Workeroperating system should be one of ``cuda`` or ``nvcl``, which determines the 115*9880d681SAndroid Build Coastguard Workerinterface used by the generated code to communicate with the driver. Most 116*9880d681SAndroid Build Coastguard Workerusers will want to use ``cuda`` as the operating system, which makes the 117*9880d681SAndroid Build Coastguard Workergenerated PTX compatible with the CUDA Driver API. 118*9880d681SAndroid Build Coastguard Worker 119*9880d681SAndroid Build Coastguard WorkerExample: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda`` 120*9880d681SAndroid Build Coastguard Worker 121*9880d681SAndroid Build Coastguard WorkerExample: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda`` 122*9880d681SAndroid Build Coastguard Worker 123*9880d681SAndroid Build Coastguard Worker 124*9880d681SAndroid Build Coastguard Worker 125*9880d681SAndroid Build Coastguard Worker.. _nvptx_intrinsics: 126*9880d681SAndroid Build Coastguard Worker 127*9880d681SAndroid Build Coastguard WorkerNVPTX Intrinsics 128*9880d681SAndroid Build Coastguard Worker================ 129*9880d681SAndroid Build Coastguard Worker 130*9880d681SAndroid Build Coastguard WorkerAddress Space Conversion 131*9880d681SAndroid Build Coastguard Worker------------------------ 132*9880d681SAndroid Build Coastguard Worker 133*9880d681SAndroid Build Coastguard Worker'``llvm.nvvm.ptr.*.to.gen``' Intrinsics 134*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 135*9880d681SAndroid Build Coastguard Worker 136*9880d681SAndroid Build Coastguard WorkerSyntax: 137*9880d681SAndroid Build Coastguard Worker""""""" 138*9880d681SAndroid Build Coastguard Worker 139*9880d681SAndroid Build Coastguard WorkerThese are overloaded intrinsics. You can use these on any pointer types. 140*9880d681SAndroid Build Coastguard Worker 141*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm 142*9880d681SAndroid Build Coastguard Worker 143*9880d681SAndroid Build Coastguard Worker declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*) 144*9880d681SAndroid Build Coastguard Worker declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*) 145*9880d681SAndroid Build Coastguard Worker declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*) 146*9880d681SAndroid Build Coastguard Worker declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*) 147*9880d681SAndroid Build Coastguard Worker 148*9880d681SAndroid Build Coastguard WorkerOverview: 149*9880d681SAndroid Build Coastguard Worker""""""""" 150*9880d681SAndroid Build Coastguard Worker 151*9880d681SAndroid Build Coastguard WorkerThe '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic 152*9880d681SAndroid Build Coastguard Workeraddress space to a generic address space pointer. 153*9880d681SAndroid Build Coastguard Worker 154*9880d681SAndroid Build Coastguard WorkerSemantics: 155*9880d681SAndroid Build Coastguard Worker"""""""""" 156*9880d681SAndroid Build Coastguard Worker 157*9880d681SAndroid Build Coastguard WorkerThese intrinsics modify the pointer value to be a valid generic address space 158*9880d681SAndroid Build Coastguard Workerpointer. 159*9880d681SAndroid Build Coastguard Worker 160*9880d681SAndroid Build Coastguard Worker 161*9880d681SAndroid Build Coastguard Worker'``llvm.nvvm.ptr.gen.to.*``' Intrinsics 162*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 163*9880d681SAndroid Build Coastguard Worker 164*9880d681SAndroid Build Coastguard WorkerSyntax: 165*9880d681SAndroid Build Coastguard Worker""""""" 166*9880d681SAndroid Build Coastguard Worker 167*9880d681SAndroid Build Coastguard WorkerThese are overloaded intrinsics. You can use these on any pointer types. 168*9880d681SAndroid Build Coastguard Worker 169*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm 170*9880d681SAndroid Build Coastguard Worker 171*9880d681SAndroid Build Coastguard Worker declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*) 172*9880d681SAndroid Build Coastguard Worker declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*) 173*9880d681SAndroid Build Coastguard Worker declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*) 174*9880d681SAndroid Build Coastguard Worker declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*) 175*9880d681SAndroid Build Coastguard Worker 176*9880d681SAndroid Build Coastguard WorkerOverview: 177*9880d681SAndroid Build Coastguard Worker""""""""" 178*9880d681SAndroid Build Coastguard Worker 179*9880d681SAndroid Build Coastguard WorkerThe '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic 180*9880d681SAndroid Build Coastguard Workeraddress space to a pointer in the target address space. Note that these 181*9880d681SAndroid Build Coastguard Workerintrinsics are only useful if the address space of the target address space of 182*9880d681SAndroid Build Coastguard Workerthe pointer is known. It is not legal to use address space conversion 183*9880d681SAndroid Build Coastguard Workerintrinsics to convert a pointer from one non-generic address space to another 184*9880d681SAndroid Build Coastguard Workernon-generic address space. 185*9880d681SAndroid Build Coastguard Worker 186*9880d681SAndroid Build Coastguard WorkerSemantics: 187*9880d681SAndroid Build Coastguard Worker"""""""""" 188*9880d681SAndroid Build Coastguard Worker 189*9880d681SAndroid Build Coastguard WorkerThese intrinsics modify the pointer value to be a valid pointer in the target 190*9880d681SAndroid Build Coastguard Workernon-generic address space. 191*9880d681SAndroid Build Coastguard Worker 192*9880d681SAndroid Build Coastguard Worker 193*9880d681SAndroid Build Coastguard WorkerReading PTX Special Registers 194*9880d681SAndroid Build Coastguard Worker----------------------------- 195*9880d681SAndroid Build Coastguard Worker 196*9880d681SAndroid Build Coastguard Worker'``llvm.nvvm.read.ptx.sreg.*``' 197*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 198*9880d681SAndroid Build Coastguard Worker 199*9880d681SAndroid Build Coastguard WorkerSyntax: 200*9880d681SAndroid Build Coastguard Worker""""""" 201*9880d681SAndroid Build Coastguard Worker 202*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm 203*9880d681SAndroid Build Coastguard Worker 204*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() 205*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() 206*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.tid.z() 207*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x() 208*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y() 209*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z() 210*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() 211*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y() 212*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z() 213*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() 214*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y() 215*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z() 216*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.warpsize() 217*9880d681SAndroid Build Coastguard Worker 218*9880d681SAndroid Build Coastguard WorkerOverview: 219*9880d681SAndroid Build Coastguard Worker""""""""" 220*9880d681SAndroid Build Coastguard Worker 221*9880d681SAndroid Build Coastguard WorkerThe '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX 222*9880d681SAndroid Build Coastguard Workerspecial registers, in particular the kernel launch bounds. These registers 223*9880d681SAndroid Build Coastguard Workermap in the following way to CUDA builtins: 224*9880d681SAndroid Build Coastguard Worker 225*9880d681SAndroid Build Coastguard Worker ============ ===================================== 226*9880d681SAndroid Build Coastguard Worker CUDA Builtin PTX Special Register Intrinsic 227*9880d681SAndroid Build Coastguard Worker ============ ===================================== 228*9880d681SAndroid Build Coastguard Worker ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*`` 229*9880d681SAndroid Build Coastguard Worker ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*`` 230*9880d681SAndroid Build Coastguard Worker ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*`` 231*9880d681SAndroid Build Coastguard Worker ``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*`` 232*9880d681SAndroid Build Coastguard Worker ============ ===================================== 233*9880d681SAndroid Build Coastguard Worker 234*9880d681SAndroid Build Coastguard Worker 235*9880d681SAndroid Build Coastguard WorkerBarriers 236*9880d681SAndroid Build Coastguard Worker-------- 237*9880d681SAndroid Build Coastguard Worker 238*9880d681SAndroid Build Coastguard Worker'``llvm.nvvm.barrier0``' 239*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^ 240*9880d681SAndroid Build Coastguard Worker 241*9880d681SAndroid Build Coastguard WorkerSyntax: 242*9880d681SAndroid Build Coastguard Worker""""""" 243*9880d681SAndroid Build Coastguard Worker 244*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm 245*9880d681SAndroid Build Coastguard Worker 246*9880d681SAndroid Build Coastguard Worker declare void @llvm.nvvm.barrier0() 247*9880d681SAndroid Build Coastguard Worker 248*9880d681SAndroid Build Coastguard WorkerOverview: 249*9880d681SAndroid Build Coastguard Worker""""""""" 250*9880d681SAndroid Build Coastguard Worker 251*9880d681SAndroid Build Coastguard WorkerThe '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0`` 252*9880d681SAndroid Build Coastguard Workerinstruction, equivalent to the ``__syncthreads()`` call in CUDA. 253*9880d681SAndroid Build Coastguard Worker 254*9880d681SAndroid Build Coastguard Worker 255*9880d681SAndroid Build Coastguard WorkerOther Intrinsics 256*9880d681SAndroid Build Coastguard Worker---------------- 257*9880d681SAndroid Build Coastguard Worker 258*9880d681SAndroid Build Coastguard WorkerFor the full set of NVPTX intrinsics, please see the 259*9880d681SAndroid Build Coastguard Worker``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree. 260*9880d681SAndroid Build Coastguard Worker 261*9880d681SAndroid Build Coastguard Worker 262*9880d681SAndroid Build Coastguard Worker.. _libdevice: 263*9880d681SAndroid Build Coastguard Worker 264*9880d681SAndroid Build Coastguard WorkerLinking with Libdevice 265*9880d681SAndroid Build Coastguard Worker====================== 266*9880d681SAndroid Build Coastguard Worker 267*9880d681SAndroid Build Coastguard WorkerThe CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that 268*9880d681SAndroid Build Coastguard Workerimplements many common mathematical functions. This library can be used as a 269*9880d681SAndroid Build Coastguard Workerhigh-performance math library for any compilers using the LLVM NVPTX target. 270*9880d681SAndroid Build Coastguard WorkerThe library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and 271*9880d681SAndroid Build Coastguard Workerthere is a separate version for each compute architecture. 272*9880d681SAndroid Build Coastguard Worker 273*9880d681SAndroid Build Coastguard WorkerFor a list of all math functions implemented in libdevice, see 274*9880d681SAndroid Build Coastguard Worker`libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_. 275*9880d681SAndroid Build Coastguard Worker 276*9880d681SAndroid Build Coastguard WorkerTo accommodate various math-related compiler flags that can affect code 277*9880d681SAndroid Build Coastguard Workergeneration of libdevice code, the library code depends on a special LLVM IR 278*9880d681SAndroid Build Coastguard Workerpass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This 279*9880d681SAndroid Build Coastguard Workerpass looks for calls to the ``@__nvvm_reflect`` function and replaces them 280*9880d681SAndroid Build Coastguard Workerwith constants based on the defined reflection parameters. Such conditional 281*9880d681SAndroid Build Coastguard Workercode often follows a pattern: 282*9880d681SAndroid Build Coastguard Worker 283*9880d681SAndroid Build Coastguard Worker.. code-block:: c++ 284*9880d681SAndroid Build Coastguard Worker 285*9880d681SAndroid Build Coastguard Worker float my_function(float a) { 286*9880d681SAndroid Build Coastguard Worker if (__nvvm_reflect("FASTMATH")) 287*9880d681SAndroid Build Coastguard Worker return my_function_fast(a); 288*9880d681SAndroid Build Coastguard Worker else 289*9880d681SAndroid Build Coastguard Worker return my_function_precise(a); 290*9880d681SAndroid Build Coastguard Worker } 291*9880d681SAndroid Build Coastguard Worker 292*9880d681SAndroid Build Coastguard WorkerThe default value for all unspecified reflection parameters is zero. 293*9880d681SAndroid Build Coastguard Worker 294*9880d681SAndroid Build Coastguard WorkerThe ``NVVMReflect`` pass should be executed early in the optimization 295*9880d681SAndroid Build Coastguard Workerpipeline, immediately after the link stage. The ``internalize`` pass is also 296*9880d681SAndroid Build Coastguard Workerrecommended to remove unused math functions from the resulting PTX. For an 297*9880d681SAndroid Build Coastguard Workerinput IR module ``module.bc``, the following compilation flow is recommended: 298*9880d681SAndroid Build Coastguard Worker 299*9880d681SAndroid Build Coastguard Worker1. Save list of external functions in ``module.bc`` 300*9880d681SAndroid Build Coastguard Worker2. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc`` 301*9880d681SAndroid Build Coastguard Worker3. Internalize all functions not in list from (1) 302*9880d681SAndroid Build Coastguard Worker4. Eliminate all unused internal functions 303*9880d681SAndroid Build Coastguard Worker5. Run ``NVVMReflect`` pass 304*9880d681SAndroid Build Coastguard Worker6. Run standard optimization pipeline 305*9880d681SAndroid Build Coastguard Worker 306*9880d681SAndroid Build Coastguard Worker.. note:: 307*9880d681SAndroid Build Coastguard Worker 308*9880d681SAndroid Build Coastguard Worker ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the 309*9880d681SAndroid Build Coastguard Worker libdevice functions. It is possible to link two IR modules that have been 310*9880d681SAndroid Build Coastguard Worker linked against libdevice using different reflection variables. 311*9880d681SAndroid Build Coastguard Worker 312*9880d681SAndroid Build Coastguard WorkerSince the ``NVVMReflect`` pass replaces conditionals with constants, it will 313*9880d681SAndroid Build Coastguard Workeroften leave behind dead code of the form: 314*9880d681SAndroid Build Coastguard Worker 315*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm 316*9880d681SAndroid Build Coastguard Worker 317*9880d681SAndroid Build Coastguard Worker entry: 318*9880d681SAndroid Build Coastguard Worker .. 319*9880d681SAndroid Build Coastguard Worker br i1 true, label %foo, label %bar 320*9880d681SAndroid Build Coastguard Worker foo: 321*9880d681SAndroid Build Coastguard Worker .. 322*9880d681SAndroid Build Coastguard Worker bar: 323*9880d681SAndroid Build Coastguard Worker ; Dead code 324*9880d681SAndroid Build Coastguard Worker .. 325*9880d681SAndroid Build Coastguard Worker 326*9880d681SAndroid Build Coastguard WorkerTherefore, it is recommended that ``NVVMReflect`` is executed early in the 327*9880d681SAndroid Build Coastguard Workeroptimization pipeline before dead-code elimination. 328*9880d681SAndroid Build Coastguard Worker 329*9880d681SAndroid Build Coastguard Worker 330*9880d681SAndroid Build Coastguard WorkerReflection Parameters 331*9880d681SAndroid Build Coastguard Worker--------------------- 332*9880d681SAndroid Build Coastguard Worker 333*9880d681SAndroid Build Coastguard WorkerThe libdevice library currently uses the following reflection parameters to 334*9880d681SAndroid Build Coastguard Workercontrol code generation: 335*9880d681SAndroid Build Coastguard Worker 336*9880d681SAndroid Build Coastguard Worker==================== ====================================================== 337*9880d681SAndroid Build Coastguard WorkerFlag Description 338*9880d681SAndroid Build Coastguard Worker==================== ====================================================== 339*9880d681SAndroid Build Coastguard Worker``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero 340*9880d681SAndroid Build Coastguard Worker==================== ====================================================== 341*9880d681SAndroid Build Coastguard Worker 342*9880d681SAndroid Build Coastguard Worker 343*9880d681SAndroid Build Coastguard WorkerInvoking NVVMReflect 344*9880d681SAndroid Build Coastguard Worker-------------------- 345*9880d681SAndroid Build Coastguard Worker 346*9880d681SAndroid Build Coastguard WorkerTo ensure that all dead code caused by the reflection pass is eliminated, it 347*9880d681SAndroid Build Coastguard Workeris recommended that the reflection pass is executed early in the LLVM IR 348*9880d681SAndroid Build Coastguard Workeroptimization pipeline. The pass takes an optional mapping of reflection 349*9880d681SAndroid Build Coastguard Workerparameter name to an integer value. This mapping can be specified as either a 350*9880d681SAndroid Build Coastguard Workercommand-line option to ``opt`` or as an LLVM ``StringMap<int>`` object when 351*9880d681SAndroid Build Coastguard Workerprogrammatically creating a pass pipeline. 352*9880d681SAndroid Build Coastguard Worker 353*9880d681SAndroid Build Coastguard WorkerWith ``opt``: 354*9880d681SAndroid Build Coastguard Worker 355*9880d681SAndroid Build Coastguard Worker.. code-block:: text 356*9880d681SAndroid Build Coastguard Worker 357*9880d681SAndroid Build Coastguard Worker # opt -nvvm-reflect -nvvm-reflect-list=<var>=<value>,<var>=<value> module.bc -o module.reflect.bc 358*9880d681SAndroid Build Coastguard Worker 359*9880d681SAndroid Build Coastguard Worker 360*9880d681SAndroid Build Coastguard WorkerWith programmatic pass pipeline: 361*9880d681SAndroid Build Coastguard Worker 362*9880d681SAndroid Build Coastguard Worker.. code-block:: c++ 363*9880d681SAndroid Build Coastguard Worker 364*9880d681SAndroid Build Coastguard Worker extern FunctionPass *llvm::createNVVMReflectPass(const StringMap<int>& Mapping); 365*9880d681SAndroid Build Coastguard Worker 366*9880d681SAndroid Build Coastguard Worker StringMap<int> ReflectParams; 367*9880d681SAndroid Build Coastguard Worker ReflectParams["__CUDA_FTZ"] = 1; 368*9880d681SAndroid Build Coastguard Worker Passes.add(createNVVMReflectPass(ReflectParams)); 369*9880d681SAndroid Build Coastguard Worker 370*9880d681SAndroid Build Coastguard Worker 371*9880d681SAndroid Build Coastguard Worker 372*9880d681SAndroid Build Coastguard WorkerExecuting PTX 373*9880d681SAndroid Build Coastguard Worker============= 374*9880d681SAndroid Build Coastguard Worker 375*9880d681SAndroid Build Coastguard WorkerThe most common way to execute PTX assembly on a GPU device is to use the CUDA 376*9880d681SAndroid Build Coastguard WorkerDriver API. This API is a low-level interface to the GPU driver and allows for 377*9880d681SAndroid Build Coastguard WorkerJIT compilation of PTX code to native GPU machine code. 378*9880d681SAndroid Build Coastguard Worker 379*9880d681SAndroid Build Coastguard WorkerInitializing the Driver API: 380*9880d681SAndroid Build Coastguard Worker 381*9880d681SAndroid Build Coastguard Worker.. code-block:: c++ 382*9880d681SAndroid Build Coastguard Worker 383*9880d681SAndroid Build Coastguard Worker CUdevice device; 384*9880d681SAndroid Build Coastguard Worker CUcontext context; 385*9880d681SAndroid Build Coastguard Worker 386*9880d681SAndroid Build Coastguard Worker // Initialize the driver API 387*9880d681SAndroid Build Coastguard Worker cuInit(0); 388*9880d681SAndroid Build Coastguard Worker // Get a handle to the first compute device 389*9880d681SAndroid Build Coastguard Worker cuDeviceGet(&device, 0); 390*9880d681SAndroid Build Coastguard Worker // Create a compute device context 391*9880d681SAndroid Build Coastguard Worker cuCtxCreate(&context, 0, device); 392*9880d681SAndroid Build Coastguard Worker 393*9880d681SAndroid Build Coastguard WorkerJIT compiling a PTX string to a device binary: 394*9880d681SAndroid Build Coastguard Worker 395*9880d681SAndroid Build Coastguard Worker.. code-block:: c++ 396*9880d681SAndroid Build Coastguard Worker 397*9880d681SAndroid Build Coastguard Worker CUmodule module; 398*9880d681SAndroid Build Coastguard Worker CUfunction function; 399*9880d681SAndroid Build Coastguard Worker 400*9880d681SAndroid Build Coastguard Worker // JIT compile a null-terminated PTX string 401*9880d681SAndroid Build Coastguard Worker cuModuleLoadData(&module, (void*)PTXString); 402*9880d681SAndroid Build Coastguard Worker 403*9880d681SAndroid Build Coastguard Worker // Get a handle to the "myfunction" kernel function 404*9880d681SAndroid Build Coastguard Worker cuModuleGetFunction(&function, module, "myfunction"); 405*9880d681SAndroid Build Coastguard Worker 406*9880d681SAndroid Build Coastguard WorkerFor full examples of executing PTX assembly, please see the `CUDA Samples 407*9880d681SAndroid Build Coastguard Worker<https://developer.nvidia.com/cuda-downloads>`_ distribution. 408*9880d681SAndroid Build Coastguard Worker 409*9880d681SAndroid Build Coastguard Worker 410*9880d681SAndroid Build Coastguard WorkerCommon Issues 411*9880d681SAndroid Build Coastguard Worker============= 412*9880d681SAndroid Build Coastguard Worker 413*9880d681SAndroid Build Coastguard Workerptxas complains of undefined function: __nvvm_reflect 414*9880d681SAndroid Build Coastguard Worker----------------------------------------------------- 415*9880d681SAndroid Build Coastguard Worker 416*9880d681SAndroid Build Coastguard WorkerWhen linking with libdevice, the ``NVVMReflect`` pass must be used. See 417*9880d681SAndroid Build Coastguard Worker:ref:`libdevice` for more information. 418*9880d681SAndroid Build Coastguard Worker 419*9880d681SAndroid Build Coastguard Worker 420*9880d681SAndroid Build Coastguard WorkerTutorial: A Simple Compute Kernel 421*9880d681SAndroid Build Coastguard Worker================================= 422*9880d681SAndroid Build Coastguard Worker 423*9880d681SAndroid Build Coastguard WorkerTo start, let us take a look at a simple compute kernel written directly in 424*9880d681SAndroid Build Coastguard WorkerLLVM IR. The kernel implements vector addition, where each thread computes one 425*9880d681SAndroid Build Coastguard Workerelement of the output vector C from the input vectors A and B. To make this 426*9880d681SAndroid Build Coastguard Workereasier, we also assume that only a single CTA (thread block) will be launched, 427*9880d681SAndroid Build Coastguard Workerand that it will be one dimensional. 428*9880d681SAndroid Build Coastguard Worker 429*9880d681SAndroid Build Coastguard Worker 430*9880d681SAndroid Build Coastguard WorkerThe Kernel 431*9880d681SAndroid Build Coastguard Worker---------- 432*9880d681SAndroid Build Coastguard Worker 433*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm 434*9880d681SAndroid Build Coastguard Worker 435*9880d681SAndroid Build Coastguard Worker target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 436*9880d681SAndroid Build Coastguard Worker target triple = "nvptx64-nvidia-cuda" 437*9880d681SAndroid Build Coastguard Worker 438*9880d681SAndroid Build Coastguard Worker ; Intrinsic to read X component of thread ID 439*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 440*9880d681SAndroid Build Coastguard Worker 441*9880d681SAndroid Build Coastguard Worker define void @kernel(float addrspace(1)* %A, 442*9880d681SAndroid Build Coastguard Worker float addrspace(1)* %B, 443*9880d681SAndroid Build Coastguard Worker float addrspace(1)* %C) { 444*9880d681SAndroid Build Coastguard Worker entry: 445*9880d681SAndroid Build Coastguard Worker ; What is my ID? 446*9880d681SAndroid Build Coastguard Worker %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 447*9880d681SAndroid Build Coastguard Worker 448*9880d681SAndroid Build Coastguard Worker ; Compute pointers into A, B, and C 449*9880d681SAndroid Build Coastguard Worker %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id 450*9880d681SAndroid Build Coastguard Worker %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id 451*9880d681SAndroid Build Coastguard Worker %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id 452*9880d681SAndroid Build Coastguard Worker 453*9880d681SAndroid Build Coastguard Worker ; Read A, B 454*9880d681SAndroid Build Coastguard Worker %valA = load float, float addrspace(1)* %ptrA, align 4 455*9880d681SAndroid Build Coastguard Worker %valB = load float, float addrspace(1)* %ptrB, align 4 456*9880d681SAndroid Build Coastguard Worker 457*9880d681SAndroid Build Coastguard Worker ; Compute C = A + B 458*9880d681SAndroid Build Coastguard Worker %valC = fadd float %valA, %valB 459*9880d681SAndroid Build Coastguard Worker 460*9880d681SAndroid Build Coastguard Worker ; Store back to C 461*9880d681SAndroid Build Coastguard Worker store float %valC, float addrspace(1)* %ptrC, align 4 462*9880d681SAndroid Build Coastguard Worker 463*9880d681SAndroid Build Coastguard Worker ret void 464*9880d681SAndroid Build Coastguard Worker } 465*9880d681SAndroid Build Coastguard Worker 466*9880d681SAndroid Build Coastguard Worker !nvvm.annotations = !{!0} 467*9880d681SAndroid Build Coastguard Worker !0 = !{void (float addrspace(1)*, 468*9880d681SAndroid Build Coastguard Worker float addrspace(1)*, 469*9880d681SAndroid Build Coastguard Worker float addrspace(1)*)* @kernel, !"kernel", i32 1} 470*9880d681SAndroid Build Coastguard Worker 471*9880d681SAndroid Build Coastguard Worker 472*9880d681SAndroid Build Coastguard WorkerWe can use the LLVM ``llc`` tool to directly run the NVPTX code generator: 473*9880d681SAndroid Build Coastguard Worker 474*9880d681SAndroid Build Coastguard Worker.. code-block:: text 475*9880d681SAndroid Build Coastguard Worker 476*9880d681SAndroid Build Coastguard Worker # llc -mcpu=sm_20 kernel.ll -o kernel.ptx 477*9880d681SAndroid Build Coastguard Worker 478*9880d681SAndroid Build Coastguard Worker 479*9880d681SAndroid Build Coastguard Worker.. note:: 480*9880d681SAndroid Build Coastguard Worker 481*9880d681SAndroid Build Coastguard Worker If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32`` 482*9880d681SAndroid Build Coastguard Worker in the module data layout string and use ``nvptx-nvidia-cuda`` as the 483*9880d681SAndroid Build Coastguard Worker target triple. 484*9880d681SAndroid Build Coastguard Worker 485*9880d681SAndroid Build Coastguard Worker 486*9880d681SAndroid Build Coastguard WorkerThe output we get from ``llc`` (as of LLVM 3.4): 487*9880d681SAndroid Build Coastguard Worker 488*9880d681SAndroid Build Coastguard Worker.. code-block:: text 489*9880d681SAndroid Build Coastguard Worker 490*9880d681SAndroid Build Coastguard Worker // 491*9880d681SAndroid Build Coastguard Worker // Generated by LLVM NVPTX Back-End 492*9880d681SAndroid Build Coastguard Worker // 493*9880d681SAndroid Build Coastguard Worker 494*9880d681SAndroid Build Coastguard Worker .version 3.1 495*9880d681SAndroid Build Coastguard Worker .target sm_20 496*9880d681SAndroid Build Coastguard Worker .address_size 64 497*9880d681SAndroid Build Coastguard Worker 498*9880d681SAndroid Build Coastguard Worker // .globl kernel 499*9880d681SAndroid Build Coastguard Worker // @kernel 500*9880d681SAndroid Build Coastguard Worker .visible .entry kernel( 501*9880d681SAndroid Build Coastguard Worker .param .u64 kernel_param_0, 502*9880d681SAndroid Build Coastguard Worker .param .u64 kernel_param_1, 503*9880d681SAndroid Build Coastguard Worker .param .u64 kernel_param_2 504*9880d681SAndroid Build Coastguard Worker ) 505*9880d681SAndroid Build Coastguard Worker { 506*9880d681SAndroid Build Coastguard Worker .reg .f32 %f<4>; 507*9880d681SAndroid Build Coastguard Worker .reg .s32 %r<2>; 508*9880d681SAndroid Build Coastguard Worker .reg .s64 %rl<8>; 509*9880d681SAndroid Build Coastguard Worker 510*9880d681SAndroid Build Coastguard Worker // BB#0: // %entry 511*9880d681SAndroid Build Coastguard Worker ld.param.u64 %rl1, [kernel_param_0]; 512*9880d681SAndroid Build Coastguard Worker mov.u32 %r1, %tid.x; 513*9880d681SAndroid Build Coastguard Worker mul.wide.s32 %rl2, %r1, 4; 514*9880d681SAndroid Build Coastguard Worker add.s64 %rl3, %rl1, %rl2; 515*9880d681SAndroid Build Coastguard Worker ld.param.u64 %rl4, [kernel_param_1]; 516*9880d681SAndroid Build Coastguard Worker add.s64 %rl5, %rl4, %rl2; 517*9880d681SAndroid Build Coastguard Worker ld.param.u64 %rl6, [kernel_param_2]; 518*9880d681SAndroid Build Coastguard Worker add.s64 %rl7, %rl6, %rl2; 519*9880d681SAndroid Build Coastguard Worker ld.global.f32 %f1, [%rl3]; 520*9880d681SAndroid Build Coastguard Worker ld.global.f32 %f2, [%rl5]; 521*9880d681SAndroid Build Coastguard Worker add.f32 %f3, %f1, %f2; 522*9880d681SAndroid Build Coastguard Worker st.global.f32 [%rl7], %f3; 523*9880d681SAndroid Build Coastguard Worker ret; 524*9880d681SAndroid Build Coastguard Worker } 525*9880d681SAndroid Build Coastguard Worker 526*9880d681SAndroid Build Coastguard Worker 527*9880d681SAndroid Build Coastguard WorkerDissecting the Kernel 528*9880d681SAndroid Build Coastguard Worker--------------------- 529*9880d681SAndroid Build Coastguard Worker 530*9880d681SAndroid Build Coastguard WorkerNow let us dissect the LLVM IR that makes up this kernel. 531*9880d681SAndroid Build Coastguard Worker 532*9880d681SAndroid Build Coastguard WorkerData Layout 533*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^ 534*9880d681SAndroid Build Coastguard Worker 535*9880d681SAndroid Build Coastguard WorkerThe data layout string determines the size in bits of common data types, their 536*9880d681SAndroid Build Coastguard WorkerABI alignment, and their storage size. For NVPTX, you should use one of the 537*9880d681SAndroid Build Coastguard Workerfollowing: 538*9880d681SAndroid Build Coastguard Worker 539*9880d681SAndroid Build Coastguard Worker32-bit PTX: 540*9880d681SAndroid Build Coastguard Worker 541*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm 542*9880d681SAndroid Build Coastguard Worker 543*9880d681SAndroid Build Coastguard Worker target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 544*9880d681SAndroid Build Coastguard Worker 545*9880d681SAndroid Build Coastguard Worker64-bit PTX: 546*9880d681SAndroid Build Coastguard Worker 547*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm 548*9880d681SAndroid Build Coastguard Worker 549*9880d681SAndroid Build Coastguard Worker target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 550*9880d681SAndroid Build Coastguard Worker 551*9880d681SAndroid Build Coastguard Worker 552*9880d681SAndroid Build Coastguard WorkerTarget Intrinsics 553*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^ 554*9880d681SAndroid Build Coastguard Worker 555*9880d681SAndroid Build Coastguard WorkerIn this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to 556*9880d681SAndroid Build Coastguard Workerread the X component of the current thread's ID, which corresponds to a read 557*9880d681SAndroid Build Coastguard Workerof register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of 558*9880d681SAndroid Build Coastguard Workerintrinsics. A short list is shown below; please see 559*9880d681SAndroid Build Coastguard Worker``include/llvm/IR/IntrinsicsNVVM.td`` for the full list. 560*9880d681SAndroid Build Coastguard Worker 561*9880d681SAndroid Build Coastguard Worker 562*9880d681SAndroid Build Coastguard Worker================================================ ==================== 563*9880d681SAndroid Build Coastguard WorkerIntrinsic CUDA Equivalent 564*9880d681SAndroid Build Coastguard Worker================================================ ==================== 565*9880d681SAndroid Build Coastguard Worker``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}`` threadIdx.{x,y,z} 566*9880d681SAndroid Build Coastguard Worker``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}`` blockIdx.{x,y,z} 567*9880d681SAndroid Build Coastguard Worker``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}`` blockDim.{x,y,z} 568*9880d681SAndroid Build Coastguard Worker``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}`` gridDim.{x,y,z} 569*9880d681SAndroid Build Coastguard Worker``void @llvm.nvvm.barrier0()`` __syncthreads() 570*9880d681SAndroid Build Coastguard Worker================================================ ==================== 571*9880d681SAndroid Build Coastguard Worker 572*9880d681SAndroid Build Coastguard Worker 573*9880d681SAndroid Build Coastguard WorkerAddress Spaces 574*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^ 575*9880d681SAndroid Build Coastguard Worker 576*9880d681SAndroid Build Coastguard WorkerYou may have noticed that all of the pointer types in the LLVM IR example had 577*9880d681SAndroid Build Coastguard Workeran explicit address space specifier. What is address space 1? NVIDIA GPU 578*9880d681SAndroid Build Coastguard Workerdevices (generally) have four types of memory: 579*9880d681SAndroid Build Coastguard Worker 580*9880d681SAndroid Build Coastguard Worker- Global: Large, off-chip memory 581*9880d681SAndroid Build Coastguard Worker- Shared: Small, on-chip memory shared among all threads in a CTA 582*9880d681SAndroid Build Coastguard Worker- Local: Per-thread, private memory 583*9880d681SAndroid Build Coastguard Worker- Constant: Read-only memory shared across all threads 584*9880d681SAndroid Build Coastguard Worker 585*9880d681SAndroid Build Coastguard WorkerThese different types of memory are represented in LLVM IR as address spaces. 586*9880d681SAndroid Build Coastguard WorkerThere is also a fifth address space used by the NVPTX code generator that 587*9880d681SAndroid Build Coastguard Workercorresponds to the "generic" address space. This address space can represent 588*9880d681SAndroid Build Coastguard Workeraddresses in any other address space (with a few exceptions). This allows 589*9880d681SAndroid Build Coastguard Workerusers to write IR functions that can load/store memory using the same 590*9880d681SAndroid Build Coastguard Workerinstructions. Intrinsics are provided to convert pointers between the generic 591*9880d681SAndroid Build Coastguard Workerand non-generic address spaces. 592*9880d681SAndroid Build Coastguard Worker 593*9880d681SAndroid Build Coastguard WorkerSee :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information. 594*9880d681SAndroid Build Coastguard Worker 595*9880d681SAndroid Build Coastguard Worker 596*9880d681SAndroid Build Coastguard WorkerKernel Metadata 597*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^ 598*9880d681SAndroid Build Coastguard Worker 599*9880d681SAndroid Build Coastguard WorkerIn PTX, a function can be either a `kernel` function (callable from the host 600*9880d681SAndroid Build Coastguard Workerprogram), or a `device` function (callable only from GPU code). You can think 601*9880d681SAndroid Build Coastguard Workerof `kernel` functions as entry-points in the GPU program. To mark an LLVM IR 602*9880d681SAndroid Build Coastguard Workerfunction as a `kernel` function, we make use of special LLVM metadata. The 603*9880d681SAndroid Build Coastguard WorkerNVPTX back-end will look for a named metadata node called 604*9880d681SAndroid Build Coastguard Worker``nvvm.annotations``. This named metadata must contain a list of metadata that 605*9880d681SAndroid Build Coastguard Workerdescribe the IR. For our purposes, we need to declare a metadata node that 606*9880d681SAndroid Build Coastguard Workerassigns the "kernel" attribute to the LLVM IR function that should be emitted 607*9880d681SAndroid Build Coastguard Workeras a PTX `kernel` function. These metadata nodes take the form: 608*9880d681SAndroid Build Coastguard Worker 609*9880d681SAndroid Build Coastguard Worker.. code-block:: text 610*9880d681SAndroid Build Coastguard Worker 611*9880d681SAndroid Build Coastguard Worker !{<function ref>, metadata !"kernel", i32 1} 612*9880d681SAndroid Build Coastguard Worker 613*9880d681SAndroid Build Coastguard WorkerFor the previous example, we have: 614*9880d681SAndroid Build Coastguard Worker 615*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm 616*9880d681SAndroid Build Coastguard Worker 617*9880d681SAndroid Build Coastguard Worker !nvvm.annotations = !{!0} 618*9880d681SAndroid Build Coastguard Worker !0 = !{void (float addrspace(1)*, 619*9880d681SAndroid Build Coastguard Worker float addrspace(1)*, 620*9880d681SAndroid Build Coastguard Worker float addrspace(1)*)* @kernel, !"kernel", i32 1} 621*9880d681SAndroid Build Coastguard Worker 622*9880d681SAndroid Build Coastguard WorkerHere, we have a single metadata declaration in ``nvvm.annotations``. This 623*9880d681SAndroid Build Coastguard Workermetadata annotates our ``@kernel`` function with the ``kernel`` attribute. 624*9880d681SAndroid Build Coastguard Worker 625*9880d681SAndroid Build Coastguard Worker 626*9880d681SAndroid Build Coastguard WorkerRunning the Kernel 627*9880d681SAndroid Build Coastguard Worker------------------ 628*9880d681SAndroid Build Coastguard Worker 629*9880d681SAndroid Build Coastguard WorkerGenerating PTX from LLVM IR is all well and good, but how do we execute it on 630*9880d681SAndroid Build Coastguard Workera real GPU device? The CUDA Driver API provides a convenient mechanism for 631*9880d681SAndroid Build Coastguard Workerloading and JIT compiling PTX to a native GPU device, and launching a kernel. 632*9880d681SAndroid Build Coastguard WorkerThe API is similar to OpenCL. A simple example showing how to load and 633*9880d681SAndroid Build Coastguard Workerexecute our vector addition code is shown below. Note that for brevity this 634*9880d681SAndroid Build Coastguard Workercode does not perform much error checking! 635*9880d681SAndroid Build Coastguard Worker 636*9880d681SAndroid Build Coastguard Worker.. note:: 637*9880d681SAndroid Build Coastguard Worker 638*9880d681SAndroid Build Coastguard Worker You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline 639*9880d681SAndroid Build Coastguard Worker compile PTX to machine code (SASS) for a specific GPU architecture. Such 640*9880d681SAndroid Build Coastguard Worker binaries can be loaded by the CUDA Driver API in the same way as PTX. This 641*9880d681SAndroid Build Coastguard Worker can be useful for reducing startup time by precompiling the PTX kernels. 642*9880d681SAndroid Build Coastguard Worker 643*9880d681SAndroid Build Coastguard Worker 644*9880d681SAndroid Build Coastguard Worker.. code-block:: c++ 645*9880d681SAndroid Build Coastguard Worker 646*9880d681SAndroid Build Coastguard Worker #include <iostream> 647*9880d681SAndroid Build Coastguard Worker #include <fstream> 648*9880d681SAndroid Build Coastguard Worker #include <cassert> 649*9880d681SAndroid Build Coastguard Worker #include "cuda.h" 650*9880d681SAndroid Build Coastguard Worker 651*9880d681SAndroid Build Coastguard Worker 652*9880d681SAndroid Build Coastguard Worker void checkCudaErrors(CUresult err) { 653*9880d681SAndroid Build Coastguard Worker assert(err == CUDA_SUCCESS); 654*9880d681SAndroid Build Coastguard Worker } 655*9880d681SAndroid Build Coastguard Worker 656*9880d681SAndroid Build Coastguard Worker /// main - Program entry point 657*9880d681SAndroid Build Coastguard Worker int main(int argc, char **argv) { 658*9880d681SAndroid Build Coastguard Worker CUdevice device; 659*9880d681SAndroid Build Coastguard Worker CUmodule cudaModule; 660*9880d681SAndroid Build Coastguard Worker CUcontext context; 661*9880d681SAndroid Build Coastguard Worker CUfunction function; 662*9880d681SAndroid Build Coastguard Worker CUlinkState linker; 663*9880d681SAndroid Build Coastguard Worker int devCount; 664*9880d681SAndroid Build Coastguard Worker 665*9880d681SAndroid Build Coastguard Worker // CUDA initialization 666*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuInit(0)); 667*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuDeviceGetCount(&devCount)); 668*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuDeviceGet(&device, 0)); 669*9880d681SAndroid Build Coastguard Worker 670*9880d681SAndroid Build Coastguard Worker char name[128]; 671*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuDeviceGetName(name, 128, device)); 672*9880d681SAndroid Build Coastguard Worker std::cout << "Using CUDA Device [0]: " << name << "\n"; 673*9880d681SAndroid Build Coastguard Worker 674*9880d681SAndroid Build Coastguard Worker int devMajor, devMinor; 675*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device)); 676*9880d681SAndroid Build Coastguard Worker std::cout << "Device Compute Capability: " 677*9880d681SAndroid Build Coastguard Worker << devMajor << "." << devMinor << "\n"; 678*9880d681SAndroid Build Coastguard Worker if (devMajor < 2) { 679*9880d681SAndroid Build Coastguard Worker std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n"; 680*9880d681SAndroid Build Coastguard Worker return 1; 681*9880d681SAndroid Build Coastguard Worker } 682*9880d681SAndroid Build Coastguard Worker 683*9880d681SAndroid Build Coastguard Worker std::ifstream t("kernel.ptx"); 684*9880d681SAndroid Build Coastguard Worker if (!t.is_open()) { 685*9880d681SAndroid Build Coastguard Worker std::cerr << "kernel.ptx not found\n"; 686*9880d681SAndroid Build Coastguard Worker return 1; 687*9880d681SAndroid Build Coastguard Worker } 688*9880d681SAndroid Build Coastguard Worker std::string str((std::istreambuf_iterator<char>(t)), 689*9880d681SAndroid Build Coastguard Worker std::istreambuf_iterator<char>()); 690*9880d681SAndroid Build Coastguard Worker 691*9880d681SAndroid Build Coastguard Worker // Create driver context 692*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuCtxCreate(&context, 0, device)); 693*9880d681SAndroid Build Coastguard Worker 694*9880d681SAndroid Build Coastguard Worker // Create module for object 695*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0)); 696*9880d681SAndroid Build Coastguard Worker 697*9880d681SAndroid Build Coastguard Worker // Get kernel function 698*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel")); 699*9880d681SAndroid Build Coastguard Worker 700*9880d681SAndroid Build Coastguard Worker // Device data 701*9880d681SAndroid Build Coastguard Worker CUdeviceptr devBufferA; 702*9880d681SAndroid Build Coastguard Worker CUdeviceptr devBufferB; 703*9880d681SAndroid Build Coastguard Worker CUdeviceptr devBufferC; 704*9880d681SAndroid Build Coastguard Worker 705*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16)); 706*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16)); 707*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16)); 708*9880d681SAndroid Build Coastguard Worker 709*9880d681SAndroid Build Coastguard Worker float* hostA = new float[16]; 710*9880d681SAndroid Build Coastguard Worker float* hostB = new float[16]; 711*9880d681SAndroid Build Coastguard Worker float* hostC = new float[16]; 712*9880d681SAndroid Build Coastguard Worker 713*9880d681SAndroid Build Coastguard Worker // Populate input 714*9880d681SAndroid Build Coastguard Worker for (unsigned i = 0; i != 16; ++i) { 715*9880d681SAndroid Build Coastguard Worker hostA[i] = (float)i; 716*9880d681SAndroid Build Coastguard Worker hostB[i] = (float)(2*i); 717*9880d681SAndroid Build Coastguard Worker hostC[i] = 0.0f; 718*9880d681SAndroid Build Coastguard Worker } 719*9880d681SAndroid Build Coastguard Worker 720*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16)); 721*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16)); 722*9880d681SAndroid Build Coastguard Worker 723*9880d681SAndroid Build Coastguard Worker 724*9880d681SAndroid Build Coastguard Worker unsigned blockSizeX = 16; 725*9880d681SAndroid Build Coastguard Worker unsigned blockSizeY = 1; 726*9880d681SAndroid Build Coastguard Worker unsigned blockSizeZ = 1; 727*9880d681SAndroid Build Coastguard Worker unsigned gridSizeX = 1; 728*9880d681SAndroid Build Coastguard Worker unsigned gridSizeY = 1; 729*9880d681SAndroid Build Coastguard Worker unsigned gridSizeZ = 1; 730*9880d681SAndroid Build Coastguard Worker 731*9880d681SAndroid Build Coastguard Worker // Kernel parameters 732*9880d681SAndroid Build Coastguard Worker void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC }; 733*9880d681SAndroid Build Coastguard Worker 734*9880d681SAndroid Build Coastguard Worker std::cout << "Launching kernel\n"; 735*9880d681SAndroid Build Coastguard Worker 736*9880d681SAndroid Build Coastguard Worker // Kernel launch 737*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ, 738*9880d681SAndroid Build Coastguard Worker blockSizeX, blockSizeY, blockSizeZ, 739*9880d681SAndroid Build Coastguard Worker 0, NULL, KernelParams, NULL)); 740*9880d681SAndroid Build Coastguard Worker 741*9880d681SAndroid Build Coastguard Worker // Retrieve device data 742*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16)); 743*9880d681SAndroid Build Coastguard Worker 744*9880d681SAndroid Build Coastguard Worker 745*9880d681SAndroid Build Coastguard Worker std::cout << "Results:\n"; 746*9880d681SAndroid Build Coastguard Worker for (unsigned i = 0; i != 16; ++i) { 747*9880d681SAndroid Build Coastguard Worker std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n"; 748*9880d681SAndroid Build Coastguard Worker } 749*9880d681SAndroid Build Coastguard Worker 750*9880d681SAndroid Build Coastguard Worker 751*9880d681SAndroid Build Coastguard Worker // Clean up after ourselves 752*9880d681SAndroid Build Coastguard Worker delete [] hostA; 753*9880d681SAndroid Build Coastguard Worker delete [] hostB; 754*9880d681SAndroid Build Coastguard Worker delete [] hostC; 755*9880d681SAndroid Build Coastguard Worker 756*9880d681SAndroid Build Coastguard Worker // Clean-up 757*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuMemFree(devBufferA)); 758*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuMemFree(devBufferB)); 759*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuMemFree(devBufferC)); 760*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuModuleUnload(cudaModule)); 761*9880d681SAndroid Build Coastguard Worker checkCudaErrors(cuCtxDestroy(context)); 762*9880d681SAndroid Build Coastguard Worker 763*9880d681SAndroid Build Coastguard Worker return 0; 764*9880d681SAndroid Build Coastguard Worker } 765*9880d681SAndroid Build Coastguard Worker 766*9880d681SAndroid Build Coastguard Worker 767*9880d681SAndroid Build Coastguard WorkerYou will need to link with the CUDA driver and specify the path to cuda.h. 768*9880d681SAndroid Build Coastguard Worker 769*9880d681SAndroid Build Coastguard Worker.. code-block:: text 770*9880d681SAndroid Build Coastguard Worker 771*9880d681SAndroid Build Coastguard Worker # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda 772*9880d681SAndroid Build Coastguard Worker 773*9880d681SAndroid Build Coastguard WorkerWe don't need to specify a path to ``libcuda.so`` since this is installed in a 774*9880d681SAndroid Build Coastguard Workersystem location by the driver, not the CUDA toolkit. 775*9880d681SAndroid Build Coastguard Worker 776*9880d681SAndroid Build Coastguard WorkerIf everything goes as planned, you should see the following output when 777*9880d681SAndroid Build Coastguard Workerrunning the compiled program: 778*9880d681SAndroid Build Coastguard Worker 779*9880d681SAndroid Build Coastguard Worker.. code-block:: text 780*9880d681SAndroid Build Coastguard Worker 781*9880d681SAndroid Build Coastguard Worker Using CUDA Device [0]: GeForce GTX 680 782*9880d681SAndroid Build Coastguard Worker Device Compute Capability: 3.0 783*9880d681SAndroid Build Coastguard Worker Launching kernel 784*9880d681SAndroid Build Coastguard Worker Results: 785*9880d681SAndroid Build Coastguard Worker 0 + 0 = 0 786*9880d681SAndroid Build Coastguard Worker 1 + 2 = 3 787*9880d681SAndroid Build Coastguard Worker 2 + 4 = 6 788*9880d681SAndroid Build Coastguard Worker 3 + 6 = 9 789*9880d681SAndroid Build Coastguard Worker 4 + 8 = 12 790*9880d681SAndroid Build Coastguard Worker 5 + 10 = 15 791*9880d681SAndroid Build Coastguard Worker 6 + 12 = 18 792*9880d681SAndroid Build Coastguard Worker 7 + 14 = 21 793*9880d681SAndroid Build Coastguard Worker 8 + 16 = 24 794*9880d681SAndroid Build Coastguard Worker 9 + 18 = 27 795*9880d681SAndroid Build Coastguard Worker 10 + 20 = 30 796*9880d681SAndroid Build Coastguard Worker 11 + 22 = 33 797*9880d681SAndroid Build Coastguard Worker 12 + 24 = 36 798*9880d681SAndroid Build Coastguard Worker 13 + 26 = 39 799*9880d681SAndroid Build Coastguard Worker 14 + 28 = 42 800*9880d681SAndroid Build Coastguard Worker 15 + 30 = 45 801*9880d681SAndroid Build Coastguard Worker 802*9880d681SAndroid Build Coastguard Worker.. note:: 803*9880d681SAndroid Build Coastguard Worker 804*9880d681SAndroid Build Coastguard Worker You will likely see a different device identifier based on your hardware 805*9880d681SAndroid Build Coastguard Worker 806*9880d681SAndroid Build Coastguard Worker 807*9880d681SAndroid Build Coastguard WorkerTutorial: Linking with Libdevice 808*9880d681SAndroid Build Coastguard Worker================================ 809*9880d681SAndroid Build Coastguard Worker 810*9880d681SAndroid Build Coastguard WorkerIn this tutorial, we show a simple example of linking LLVM IR with the 811*9880d681SAndroid Build Coastguard Workerlibdevice library. We will use the same kernel as the previous tutorial, 812*9880d681SAndroid Build Coastguard Workerexcept that we will compute ``C = pow(A, B)`` instead of ``C = A + B``. 813*9880d681SAndroid Build Coastguard WorkerLibdevice provides an ``__nv_powf`` function that we will use. 814*9880d681SAndroid Build Coastguard Worker 815*9880d681SAndroid Build Coastguard Worker.. code-block:: llvm 816*9880d681SAndroid Build Coastguard Worker 817*9880d681SAndroid Build Coastguard Worker target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 818*9880d681SAndroid Build Coastguard Worker target triple = "nvptx64-nvidia-cuda" 819*9880d681SAndroid Build Coastguard Worker 820*9880d681SAndroid Build Coastguard Worker ; Intrinsic to read X component of thread ID 821*9880d681SAndroid Build Coastguard Worker declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 822*9880d681SAndroid Build Coastguard Worker ; libdevice function 823*9880d681SAndroid Build Coastguard Worker declare float @__nv_powf(float, float) 824*9880d681SAndroid Build Coastguard Worker 825*9880d681SAndroid Build Coastguard Worker define void @kernel(float addrspace(1)* %A, 826*9880d681SAndroid Build Coastguard Worker float addrspace(1)* %B, 827*9880d681SAndroid Build Coastguard Worker float addrspace(1)* %C) { 828*9880d681SAndroid Build Coastguard Worker entry: 829*9880d681SAndroid Build Coastguard Worker ; What is my ID? 830*9880d681SAndroid Build Coastguard Worker %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 831*9880d681SAndroid Build Coastguard Worker 832*9880d681SAndroid Build Coastguard Worker ; Compute pointers into A, B, and C 833*9880d681SAndroid Build Coastguard Worker %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id 834*9880d681SAndroid Build Coastguard Worker %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id 835*9880d681SAndroid Build Coastguard Worker %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id 836*9880d681SAndroid Build Coastguard Worker 837*9880d681SAndroid Build Coastguard Worker ; Read A, B 838*9880d681SAndroid Build Coastguard Worker %valA = load float, float addrspace(1)* %ptrA, align 4 839*9880d681SAndroid Build Coastguard Worker %valB = load float, float addrspace(1)* %ptrB, align 4 840*9880d681SAndroid Build Coastguard Worker 841*9880d681SAndroid Build Coastguard Worker ; Compute C = pow(A, B) 842*9880d681SAndroid Build Coastguard Worker %valC = call float @__nv_powf(float %valA, float %valB) 843*9880d681SAndroid Build Coastguard Worker 844*9880d681SAndroid Build Coastguard Worker ; Store back to C 845*9880d681SAndroid Build Coastguard Worker store float %valC, float addrspace(1)* %ptrC, align 4 846*9880d681SAndroid Build Coastguard Worker 847*9880d681SAndroid Build Coastguard Worker ret void 848*9880d681SAndroid Build Coastguard Worker } 849*9880d681SAndroid Build Coastguard Worker 850*9880d681SAndroid Build Coastguard Worker !nvvm.annotations = !{!0} 851*9880d681SAndroid Build Coastguard Worker !0 = !{void (float addrspace(1)*, 852*9880d681SAndroid Build Coastguard Worker float addrspace(1)*, 853*9880d681SAndroid Build Coastguard Worker float addrspace(1)*)* @kernel, !"kernel", i32 1} 854*9880d681SAndroid Build Coastguard Worker 855*9880d681SAndroid Build Coastguard Worker 856*9880d681SAndroid Build Coastguard WorkerTo compile this kernel, we perform the following steps: 857*9880d681SAndroid Build Coastguard Worker 858*9880d681SAndroid Build Coastguard Worker1. Link with libdevice 859*9880d681SAndroid Build Coastguard Worker2. Internalize all but the public kernel function 860*9880d681SAndroid Build Coastguard Worker3. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0 861*9880d681SAndroid Build Coastguard Worker4. Optimize the linked module 862*9880d681SAndroid Build Coastguard Worker5. Codegen the module 863*9880d681SAndroid Build Coastguard Worker 864*9880d681SAndroid Build Coastguard Worker 865*9880d681SAndroid Build Coastguard WorkerThese steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc`` 866*9880d681SAndroid Build Coastguard Workertools. In a complete compiler, these steps can also be performed entirely 867*9880d681SAndroid Build Coastguard Workerprogrammatically by setting up an appropriate pass configuration (see 868*9880d681SAndroid Build Coastguard Worker:ref:`libdevice`). 869*9880d681SAndroid Build Coastguard Worker 870*9880d681SAndroid Build Coastguard Worker.. code-block:: text 871*9880d681SAndroid Build Coastguard Worker 872*9880d681SAndroid Build Coastguard Worker # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc 873*9880d681SAndroid Build Coastguard Worker # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc 874*9880d681SAndroid Build Coastguard Worker # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx 875*9880d681SAndroid Build Coastguard Worker 876*9880d681SAndroid Build Coastguard Worker.. note:: 877*9880d681SAndroid Build Coastguard Worker 878*9880d681SAndroid Build Coastguard Worker The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any 879*9880d681SAndroid Build Coastguard Worker undefined variables will default to zero. It is shown here for evaluation 880*9880d681SAndroid Build Coastguard Worker purposes. 881*9880d681SAndroid Build Coastguard Worker 882*9880d681SAndroid Build Coastguard Worker 883*9880d681SAndroid Build Coastguard WorkerThis gives us the following PTX (excerpt): 884*9880d681SAndroid Build Coastguard Worker 885*9880d681SAndroid Build Coastguard Worker.. code-block:: text 886*9880d681SAndroid Build Coastguard Worker 887*9880d681SAndroid Build Coastguard Worker // 888*9880d681SAndroid Build Coastguard Worker // Generated by LLVM NVPTX Back-End 889*9880d681SAndroid Build Coastguard Worker // 890*9880d681SAndroid Build Coastguard Worker 891*9880d681SAndroid Build Coastguard Worker .version 3.1 892*9880d681SAndroid Build Coastguard Worker .target sm_20 893*9880d681SAndroid Build Coastguard Worker .address_size 64 894*9880d681SAndroid Build Coastguard Worker 895*9880d681SAndroid Build Coastguard Worker // .globl kernel 896*9880d681SAndroid Build Coastguard Worker // @kernel 897*9880d681SAndroid Build Coastguard Worker .visible .entry kernel( 898*9880d681SAndroid Build Coastguard Worker .param .u64 kernel_param_0, 899*9880d681SAndroid Build Coastguard Worker .param .u64 kernel_param_1, 900*9880d681SAndroid Build Coastguard Worker .param .u64 kernel_param_2 901*9880d681SAndroid Build Coastguard Worker ) 902*9880d681SAndroid Build Coastguard Worker { 903*9880d681SAndroid Build Coastguard Worker .reg .pred %p<30>; 904*9880d681SAndroid Build Coastguard Worker .reg .f32 %f<111>; 905*9880d681SAndroid Build Coastguard Worker .reg .s32 %r<21>; 906*9880d681SAndroid Build Coastguard Worker .reg .s64 %rl<8>; 907*9880d681SAndroid Build Coastguard Worker 908*9880d681SAndroid Build Coastguard Worker // BB#0: // %entry 909*9880d681SAndroid Build Coastguard Worker ld.param.u64 %rl2, [kernel_param_0]; 910*9880d681SAndroid Build Coastguard Worker mov.u32 %r3, %tid.x; 911*9880d681SAndroid Build Coastguard Worker ld.param.u64 %rl3, [kernel_param_1]; 912*9880d681SAndroid Build Coastguard Worker mul.wide.s32 %rl4, %r3, 4; 913*9880d681SAndroid Build Coastguard Worker add.s64 %rl5, %rl2, %rl4; 914*9880d681SAndroid Build Coastguard Worker ld.param.u64 %rl6, [kernel_param_2]; 915*9880d681SAndroid Build Coastguard Worker add.s64 %rl7, %rl3, %rl4; 916*9880d681SAndroid Build Coastguard Worker add.s64 %rl1, %rl6, %rl4; 917*9880d681SAndroid Build Coastguard Worker ld.global.f32 %f1, [%rl5]; 918*9880d681SAndroid Build Coastguard Worker ld.global.f32 %f2, [%rl7]; 919*9880d681SAndroid Build Coastguard Worker setp.eq.f32 %p1, %f1, 0f3F800000; 920*9880d681SAndroid Build Coastguard Worker setp.eq.f32 %p2, %f2, 0f00000000; 921*9880d681SAndroid Build Coastguard Worker or.pred %p3, %p1, %p2; 922*9880d681SAndroid Build Coastguard Worker @%p3 bra BB0_1; 923*9880d681SAndroid Build Coastguard Worker bra.uni BB0_2; 924*9880d681SAndroid Build Coastguard Worker BB0_1: 925*9880d681SAndroid Build Coastguard Worker mov.f32 %f110, 0f3F800000; 926*9880d681SAndroid Build Coastguard Worker st.global.f32 [%rl1], %f110; 927*9880d681SAndroid Build Coastguard Worker ret; 928*9880d681SAndroid Build Coastguard Worker BB0_2: // %__nv_isnanf.exit.i 929*9880d681SAndroid Build Coastguard Worker abs.f32 %f4, %f1; 930*9880d681SAndroid Build Coastguard Worker setp.gtu.f32 %p4, %f4, 0f7F800000; 931*9880d681SAndroid Build Coastguard Worker @%p4 bra BB0_4; 932*9880d681SAndroid Build Coastguard Worker // BB#3: // %__nv_isnanf.exit5.i 933*9880d681SAndroid Build Coastguard Worker abs.f32 %f5, %f2; 934*9880d681SAndroid Build Coastguard Worker setp.le.f32 %p5, %f5, 0f7F800000; 935*9880d681SAndroid Build Coastguard Worker @%p5 bra BB0_5; 936*9880d681SAndroid Build Coastguard Worker BB0_4: // %.critedge1.i 937*9880d681SAndroid Build Coastguard Worker add.f32 %f110, %f1, %f2; 938*9880d681SAndroid Build Coastguard Worker st.global.f32 [%rl1], %f110; 939*9880d681SAndroid Build Coastguard Worker ret; 940*9880d681SAndroid Build Coastguard Worker BB0_5: // %__nv_isinff.exit.i 941*9880d681SAndroid Build Coastguard Worker 942*9880d681SAndroid Build Coastguard Worker ... 943*9880d681SAndroid Build Coastguard Worker 944*9880d681SAndroid Build Coastguard Worker BB0_26: // %__nv_truncf.exit.i.i.i.i.i 945*9880d681SAndroid Build Coastguard Worker mul.f32 %f90, %f107, 0f3FB8AA3B; 946*9880d681SAndroid Build Coastguard Worker cvt.rzi.f32.f32 %f91, %f90; 947*9880d681SAndroid Build Coastguard Worker mov.f32 %f92, 0fBF317200; 948*9880d681SAndroid Build Coastguard Worker fma.rn.f32 %f93, %f91, %f92, %f107; 949*9880d681SAndroid Build Coastguard Worker mov.f32 %f94, 0fB5BFBE8E; 950*9880d681SAndroid Build Coastguard Worker fma.rn.f32 %f95, %f91, %f94, %f93; 951*9880d681SAndroid Build Coastguard Worker mul.f32 %f89, %f95, 0f3FB8AA3B; 952*9880d681SAndroid Build Coastguard Worker // inline asm 953*9880d681SAndroid Build Coastguard Worker ex2.approx.ftz.f32 %f88,%f89; 954*9880d681SAndroid Build Coastguard Worker // inline asm 955*9880d681SAndroid Build Coastguard Worker add.f32 %f96, %f91, 0f00000000; 956*9880d681SAndroid Build Coastguard Worker ex2.approx.f32 %f97, %f96; 957*9880d681SAndroid Build Coastguard Worker mul.f32 %f98, %f88, %f97; 958*9880d681SAndroid Build Coastguard Worker setp.lt.f32 %p15, %f107, 0fC2D20000; 959*9880d681SAndroid Build Coastguard Worker selp.f32 %f99, 0f00000000, %f98, %p15; 960*9880d681SAndroid Build Coastguard Worker setp.gt.f32 %p16, %f107, 0f42D20000; 961*9880d681SAndroid Build Coastguard Worker selp.f32 %f110, 0f7F800000, %f99, %p16; 962*9880d681SAndroid Build Coastguard Worker setp.eq.f32 %p17, %f110, 0f7F800000; 963*9880d681SAndroid Build Coastguard Worker @%p17 bra BB0_28; 964*9880d681SAndroid Build Coastguard Worker // BB#27: 965*9880d681SAndroid Build Coastguard Worker fma.rn.f32 %f110, %f110, %f108, %f110; 966*9880d681SAndroid Build Coastguard Worker BB0_28: // %__internal_accurate_powf.exit.i 967*9880d681SAndroid Build Coastguard Worker setp.lt.f32 %p18, %f1, 0f00000000; 968*9880d681SAndroid Build Coastguard Worker setp.eq.f32 %p19, %f3, 0f3F800000; 969*9880d681SAndroid Build Coastguard Worker and.pred %p20, %p18, %p19; 970*9880d681SAndroid Build Coastguard Worker @!%p20 bra BB0_30; 971*9880d681SAndroid Build Coastguard Worker bra.uni BB0_29; 972*9880d681SAndroid Build Coastguard Worker BB0_29: 973*9880d681SAndroid Build Coastguard Worker mov.b32 %r9, %f110; 974*9880d681SAndroid Build Coastguard Worker xor.b32 %r10, %r9, -2147483648; 975*9880d681SAndroid Build Coastguard Worker mov.b32 %f110, %r10; 976*9880d681SAndroid Build Coastguard Worker BB0_30: // %__nv_powf.exit 977*9880d681SAndroid Build Coastguard Worker st.global.f32 [%rl1], %f110; 978*9880d681SAndroid Build Coastguard Worker ret; 979*9880d681SAndroid Build Coastguard Worker } 980*9880d681SAndroid Build Coastguard Worker 981