1.. _numerical_accuracy: 2 3Numerical accuracy 4================== 5 6In modern computers, floating point numbers are represented using IEEE 754 standard. 7For more details on floating point arithmetic and IEEE 754 standard, please see 8`Floating point arithmetic <https://en.wikipedia.org/wiki/Floating-point_arithmetic>`_ 9In particular, note that floating point provides limited accuracy (about 7 decimal digits 10for single precision floating point numbers, about 16 decimal digits for double precision 11floating point numbers) and that floating point addition and multiplication are not 12associative, so the order of the operations affects the results. 13Because of this, PyTorch is not guaranteed 14to produce bitwise identical results for floating point computations that are 15mathematically identical. Similarly, bitwise identical results are not guaranteed across 16PyTorch releases, individual commits, or different platforms. In particular, CPU and GPU 17results can be different even for bitwise-identical inputs and even after controlling for 18the sources of randomness. 19 20Batched computations or slice computations 21------------------------------------------ 22 23Many operations in PyTorch support batched computation, where the same operation is performed 24for the elements of the batches of inputs. An example of this is :meth:`torch.mm` and 25:meth:`torch.bmm`. It is possible to implement batched computation as a loop over batch elements, 26and apply the necessary math operations to the individual batch elements, for efficiency reasons 27we are not doing that, and typically perform computation for the whole batch. The mathematical 28libraries that we are calling, and PyTorch internal implementations of operations can produces 29slightly different results in this case, compared to non-batched computations. In particular, 30let ``A`` and ``B`` be 3D tensors with the dimensions suitable for batched matrix multiplication. 31Then ``(A@B)[0]`` (the first element of the batched result) is not guaranteed to be bitwise 32identical to ``A[0]@B[0]`` (the matrix product of the first elements of the input batches) 33even though mathematically it's an identical computation. 34 35Similarly, an operation applied to a tensor slice is not guaranteed to produce results that are 36identical to the slice of the result of the same operation applied to the full tensor. E.g. let 37``A`` be a 2-dimensional tensor. ``A.sum(-1)[0]`` is not guaranteed to be bitwise equal to 38``A[:,0].sum()``. 39 40 41Extremal values 42--------------- 43 44When inputs contain large values such that intermediate results may overflow the range of the 45used datatype, the end result may overflow too, even though it is representable in the original 46datatype. E.g.: 47 48.. code:: python 49 50 import torch 51 a=torch.tensor([1e20, 1e20]) # fp32 type by default 52 a.norm() # produces tensor(inf) 53 a.double().norm() # produces tensor(1.4142e+20, dtype=torch.float64), representable in fp32 54 55.. _Linear Algebra Stability: 56 57Linear algebra (``torch.linalg``) 58--------------------------------- 59 60Non-finite values 61""""""""""""""""" 62 63The external libraries (backends) that ``torch.linalg`` uses provide no guarantees on their behaviour 64when the inputs have non-finite values like ``inf`` or ``NaN``. As such, neither does PyTorch. 65The operations may return a tensor with non-finite values, or raise an exception, or even segfault. 66 67Consider using :func:`torch.isfinite` before calling these functions to detect this situation. 68 69Extremal values in linalg 70""""""""""""""""""""""""" 71 72Functions within ``torch.linalg`` have more `Extremal Values`_ than other PyTorch functions. 73 74:ref:`linalg solvers` and :ref:`linalg inverses` assume that the input matrix ``A`` is invertible. If it is close to 75being non-invertible (for example, if it has a very small singular value), then these algorithms may silently return 76incorrect results. These matrices are said to be `ill-conditioned <https://nhigham.com/2020/03/19/what-is-a-condition-number/>`_. 77If provided with ill-conditioned inputs, the result of these functions they may vary when using the same inputs on different 78devices or when using different backends via the keyword ``driver``. 79 80Spectral operations like ``svd``, ``eig``, and ``eigh`` may also return incorrect results (and their gradients may be infinite) 81when their inputs have singular values that are close to each other. This is because the algorithms used to compute these decompositions 82struggle to converge for these inputs. 83 84Running the computation in ``float64`` (as NumPy does by default) often helps, but it does not solve these issues in all cases. 85Analyzing the spectrum of the inputs via :func:`torch.linalg.svdvals` or their condition number via :func:`torch.linalg.cond` 86may help to detect these issues. 87 88 89TensorFloat-32(TF32) on Nvidia Ampere (and later) devices 90--------------------------------------------------------- 91 92On Ampere (and later) Nvidia GPUs, PyTorch can use TensorFloat32 (TF32) to speed up mathematically intensive operations, in particular matrix multiplications and convolutions. 93When an operation is performed using TF32 tensor cores, only the first 10 bits of the input mantissa are read. 94This may reduce accuracy and produce surprising results (e.g., multiplying a matrix by the identity matrix may produce results that are different from the input). 95By default, TF32 tensor cores are disabled for matrix multiplications and enabled for convolutions, although most neural network workloads have the same convergence behavior when using TF32 as they have with fp32. 96We recommend enabling TF32 tensor cores for matrix multiplications with ``torch.backends.cuda.matmul.allow_tf32 = True`` if your network does not need full float32 precision. 97If your network needs full float32 precision for both matrix multiplications and convolutions, then TF32 tensor cores can also be disabled for convolutions with ``torch.backends.cudnn.allow_tf32 = False``. 98 99For more information see :ref:`TensorFloat32<tf32_on_ampere>`. 100 101Reduced Precision Reduction for FP16 and BF16 GEMMs 102---------------------------------------------------- 103Half-precision GEMM operations are typically done with intermediate accumulations (reduction) in single-precision for numerical accuracy and improved resilience to overflow. For performance, certain GPU architectures, especially more recent ones, allow a few truncations of the intermediate accumulation results to the reduced precision (e.g., half-precision). This change is often benign from the perspective of model convergence, though it may lead to unexpected results (e.g., ``inf`` values when the final result should be be representable in half-precision). 104If reduced-precision reductions are problematic, they can be turned off with 105``torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False`` 106 107A similar flag exists for BF16 GEMM operations and is turned on by default. If BF16 108reduced-precision reductions are problematic, they can be turned off with 109``torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False`` 110 111For more information see :ref:`allow_fp16_reduced_precision_reduction<fp16reducedprecision>` and :ref:`allow_bf16_reduced_precision_reduction<bf16reducedprecision>` 112 113Reduced Precision Reduction for FP16 and BF16 in Scaled Dot Product Attention (SDPA) 114------------------------------------------------------------------------------------ 115A naive SDPA math backend, when using FP16/BF16 inputs, can accumulate significant numerical errors due to the usage of low-precision intermediate buffers. To mitigate this issue, the default behavior now involves upcasting FP16/BF16 inputs to FP32. Computations are performed in FP32/TF32, and the final FP32 results are then downcasted back to FP16/BF16. This will improve numerical accuracy of the final output for the math backend with FP16/BF16 inputs, but increases memory usages and may cause the performance regressions in the math backend as computations shift from FP16/BF16 BMM to FP32/TF32 BMM/Matmul. 116 117For scenarios where reduced-precision reductions are preferred for speed, they can be enabled with the following setting: 118``torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)`` 119 120.. _fp16_on_mi200: 121 122Reduced Precision FP16 and BF16 GEMMs and Convolutions on AMD Instinct MI200 devices 123------------------------------------------------------------------------------------ 124On AMD Instinct MI200 GPUs, the FP16 and BF16 V_DOT2 and MFMA matrix instructions flush input and output denormal values to zero. FP32 and FP64 MFMA matrix instructions do not flush input and output denormal values to zero. The affected instructions are only used by rocBLAS (GEMM) and MIOpen (convolution) kernels; all other PyTorch operations will not encounter this behavior. All other supported AMD GPUs will not encounter this behavior. 125 126rocBLAS and MIOpen provide alternate implementations for affected FP16 operations. Alternate implementations for BF16 operations are not provided; BF16 numbers have a larger dynamic range than FP16 numbers and are less likely to encounter denormal values. For the FP16 alternate implementations, FP16 input values are cast to an intermediate BF16 value and then cast back to FP16 output after the accumulate FP32 operations. In this way, the input and output types are unchanged. 127 128When training using FP16 precision, some models may fail to converge with FP16 denorms flushed to zero. Denormal values more frequently occur in the backward pass of training during gradient calculation. PyTorch by default will use the rocBLAS and MIOpen alternate implementations during the backward pass. The default behavior can be overridden using environment variables, ROCBLAS_INTERNAL_FP16_ALT_IMPL and MIOPEN_DEBUG_CONVOLUTION_ATTRIB_FP16_ALT_IMPL. The behavior of these environment variables is as follows: 129 130+---------------+-----------+-----------+ 131| | forward | backward | 132+===============+===========+===========+ 133| Env unset | original | alternate | 134+---------------+-----------+-----------+ 135| Env set to 1 | alternate | alternate | 136+---------------+-----------+-----------+ 137| Env set to 0 | original | original | 138+---------------+-----------+-----------+ 139 140The following is the list of operations where rocBLAS may be used: 141 142* torch.addbmm 143* torch.addmm 144* torch.baddbmm 145* torch.bmm 146* torch.mm 147* torch.nn.GRUCell 148* torch.nn.LSTMCell 149* torch.nn.Linear 150* torch.sparse.addmm 151* the following torch._C._ConvBackend implementations: 152 153 * slowNd 154 * slowNd_transposed 155 * slowNd_dilated 156 * slowNd_dilated_transposed 157 158The following is the list of operations where MIOpen may be used: 159 160* torch.nn.Conv[Transpose]Nd 161* the following torch._C._ConvBackend implementations: 162 163 * ConvBackend::Miopen 164 * ConvBackend::MiopenDepthwise 165 * ConvBackend::MiopenTranspose 166