1Platform: Portable Computing Language 2 Device: NVIDIA A100-SXM4-40GB 3 Driver version : 3.0-rc2 (Linux x64) 4 Compute units : 108 5 Clock frequency : 1410 MHz 6 7 Global memory bandwidth (GBPS) 8 float : 1301.28 9 float2 : 1369.03 10 float4 : 1406.91 11 float8 : 1438.37 12 float16 : 1460.08 13 14 Single-precision compute (GFLOPS) 15 float : 19402.00 16 float2 : 19361.56 17 float4 : 19360.86 18 float8 : 19281.99 19 float16 : 19139.73 20 21 No half precision support! Skipped 22 23 Double-precision compute (GFLOPS) 24 double : 9718.42 25 double2 : 9697.19 26 double4 : 9686.17 27 double8 : 9653.11 28 double16 : 9576.27 29 30 Integer compute (GIOPS) 31 int : 19318.55 32 int2 : 19315.23 33 int4 : 19360.05 34 int8 : 19316.09 35 int16 : 19305.90 36 37 Integer compute Fast 24bit (GIOPS) 38 int : 19322.74 39 int2 : 19319.41 40 int4 : 19333.47 41 int8 : 19316.84 42 int16 : 19306.22 43 44 Transfer bandwidth (GBPS) 45 enqueueWriteBuffer : 20.22 46 enqueueReadBuffer : 7.93 47 enqueueWriteBuffer non-blocking : 20.21 48 enqueueReadBuffer non-blocking : 7.92 49 enqueueMapBuffer(for read) : 141281.83 50 memcpy from mapped ptr : 20.48 51 enqueueUnmap(after write) : 15.90 52 memcpy to mapped ptr : 20.23 53 54 Kernel launch latency : 7195.83 us 55 56 Device: NVIDIA A100-SXM4-40GB 57 Driver version : 3.0-rc2 (Linux x64) 58 Compute units : 108 59 Clock frequency : 1410 MHz 60 61 Global memory bandwidth (GBPS) 62 float : 1298.47 63 float2 : 1368.92 64 float4 : 1406.60 65 float8 : 1439.31 66 float16 : 1460.02 67 68 Single-precision compute (GFLOPS) 69 float : 19388.10 70 float2 : 19356.01 71 float4 : 19356.55 72 float8 : 19277.93 73 float16 : 19135.15 74 75 No half precision support! Skipped 76 77 Double-precision compute (GFLOPS) 78 double : 9713.43 79 double2 : 9692.54 80 double4 : 9680.89 81 double8 : 9647.49 82 double16 : 9570.05 83 84 Integer compute (GIOPS) 85 int : 19316.41 86 int2 : 19339.49 87 int4 : 19328.43 88 int8 : 19311.48 89 int16 : 19300.44 90 91 Integer compute Fast 24bit (GIOPS) 92 int : 19317.16 93 int2 : 19313.40 94 int4 : 19327.89 95 int8 : 19311.15 96 int16 : 19299.80 97 98 Transfer bandwidth (GBPS) 99 enqueueWriteBuffer : 14.44 100 enqueueReadBuffer : 13.10 101 enqueueWriteBuffer non-blocking : 14.41 102 enqueueReadBuffer non-blocking : 13.10 103 enqueueMapBuffer(for read) : 26.35 104 memcpy from mapped ptr : 19.53 105 enqueueUnmap(after write) : 26.77 106 memcpy to mapped ptr : 20.62 107 108 Kernel launch latency : 9458.67 us 109 110 Device: NVIDIA A100-SXM4-40GB 111 Driver version : 3.0-rc2 (Linux x64) 112 Compute units : 108 113 Clock frequency : 1410 MHz 114 115 Global memory bandwidth (GBPS) 116 float : 1299.52 117 float2 : 1369.10 118 float4 : 1406.73 119 float8 : 1440.49 120 float16 : 1460.83 121 122 Single-precision compute (GFLOPS) 123 float : 19401.13 124 float2 : 19356.17 125 float4 : 19356.55 126 float8 : 19277.87 127 float16 : 19135.10 128 129 No half precision support! Skipped 130 131 Double-precision compute (GFLOPS) 132 double : 9714.25 133 double2 : 9693.57 134 double4 : 9682.23 135 double8 : 9647.81 136 double16 : 9571.95 137 138 Integer compute (GIOPS) 139 int : 19317.69 140 int2 : 19341.86 141 int4 : 19328.53 142 int8 : 19312.01 143 int16 : 19301.08 144 145 Integer compute Fast 24bit (GIOPS) 146 int : 19317.91 147 int2 : 19314.69 148 int4 : 19328.53 149 int8 : 19311.80 150 int16 : 19300.76 151 152 Transfer bandwidth (GBPS) 153 enqueueWriteBuffer : 14.53 154 enqueueReadBuffer : 9.13 155 enqueueWriteBuffer non-blocking : 14.44 156 enqueueReadBuffer non-blocking : 9.12 157 enqueueMapBuffer(for read) : 26.35 158 memcpy from mapped ptr : 19.40 159 enqueueUnmap(after write) : 26.77 160 memcpy to mapped ptr : 20.62 161 162 Kernel launch latency : 11937.56 us 163 164 Device: NVIDIA A100-SXM4-40GB 165 Driver version : 3.0-rc2 (Linux x64) 166 Compute units : 108 167 Clock frequency : 1410 MHz 168 169 Global memory bandwidth (GBPS) 170 float : 1304.24 171 float2 : 1369.08 172 float4 : 1406.75 173 float8 : 1439.62 174 float16 : 1460.71 175 176 Single-precision compute (GFLOPS) 177 float : 19393.56 178 float2 : 19365.28 179 float4 : 19365.01 180 float8 : 19286.58 181 float16 : 19144.05 182 183 No half precision support! Skipped 184 185 Double-precision compute (GFLOPS) 186 double : 9720.38 187 double2 : 9699.67 188 double4 : 9688.97 189 double8 : 9655.90 190 double16 : 9580.43 191 192 Integer compute (GIOPS) 193 int : 19324.88 194 int2 : 19321.23 195 int4 : 19366.62 196 int8 : 19321.13 197 int16 : 19310.40 198 199 Integer compute Fast 24bit (GIOPS) 200 int : 19327.03 201 int2 : 19323.49 202 int4 : 19337.24 203 int8 : 19320.91 204 int16 : 19310.19 205 206 Transfer bandwidth (GBPS) 207 enqueueWriteBuffer : 14.41 208 enqueueReadBuffer : 6.99 209 enqueueWriteBuffer non-blocking : 14.38 210 enqueueReadBuffer non-blocking : 7.00 211 enqueueMapBuffer(for read) : 25.94 212 memcpy from mapped ptr : 20.83 213 enqueueUnmap(after write) : 26.77 214 memcpy to mapped ptr : 20.56 215 216 Kernel launch latency : 15067.95 us 217