Name Date Size #Lines LOC

..--

Activation.cppH A D25-Apr-202559.1 KiB1,4341,363

AdaptiveAvgPoolKernel.cppH A D25-Apr-202531.8 KiB863706

AdaptiveMaxPoolKernel.cppH A D25-Apr-202538.7 KiB989802

AmpGradScalerKernels.cppH A D25-Apr-20258.1 KiB199146

AtomicAddFloat.hH A D25-Apr-2025857 3830

AvgPoolKernel.cppH A D25-Apr-202540.3 KiB1,139946

BinaryOpsKernel.cppH A D25-Apr-202553 KiB1,4391,345

BlasKernel.cppH A D25-Apr-202515.9 KiB541475

CatKernel.cppH A D25-Apr-20252.2 KiB6957

CatKernel.hH A D25-Apr-2025307 138

ChannelShuffleKernel.cppH A D25-Apr-20253.7 KiB11790

ChannelShuffleKernel.hH A D25-Apr-2025287 1510

ComplexKernel.cppH A D25-Apr-2025925 3225

CopyKernel.cppH A D25-Apr-202513.3 KiB331286

CopyKernel.hH A D25-Apr-2025312 159

CrossKernel.cppH A D25-Apr-20252.8 KiB8270

DepthwiseConvKernel.cppH A D25-Apr-202518.6 KiB526444

DepthwiseConvKernel.hH A D25-Apr-2025471 2211

DistanceOpsKernel.cppH A D25-Apr-202518.4 KiB452334

DistributionKernels.cppH A D25-Apr-202510.6 KiB251194

DistributionTemplates.hH A D25-Apr-202516.1 KiB426362

FillKernel.cppH A D25-Apr-20252.7 KiB7362

FlashAttentionKernel.cppH A D25-Apr-202531.6 KiB844734

FunctionOfAMatrixUtilsKernel.cppH A D25-Apr-20251.6 KiB5845

FusedAdagradKernel.cppH A D25-Apr-20257 KiB219202

FusedAdamKernel.cppH A D25-Apr-202514.2 KiB369338

FusedSGDKernel.cppH A D25-Apr-20258.5 KiB269256

GridSamplerKernel.cppH A D25-Apr-202556.2 KiB1,323877

GridSamplerKernel.hH A D25-Apr-2025825 3528

HistogramKernel.cppH A D25-Apr-202513.1 KiB315176

IndexKernel.cppH A D25-Apr-202530.4 KiB799608

IndexKernelUtils.hH A D25-Apr-20252.9 KiB8875

Intrinsics.hH A D25-Apr-20251.2 KiB3423

IsContiguous.hH A D25-Apr-20252.4 KiB6345

LerpKernel.cppH A D25-Apr-20256.5 KiB166149

LinearAlgebraKernel.cppH A D25-Apr-20252.7 KiB9076

LogAddExp.hH A D25-Apr-20252.4 KiB6244

Loops.hH A D25-Apr-202514.5 KiB395285

MaxPoolKernel.cppH A D25-Apr-202527.7 KiB748619

MaxPooling.cppH A D25-Apr-20251.8 KiB6554

MaxUnpoolKernel.cppH A D25-Apr-20258.8 KiB273212

MaxUnpoolKernel.hH A D25-Apr-2025308 159

MultinomialKernel.cppH A D25-Apr-20258.7 KiB246178

NativeMultiheadAttnKernel.cppH A D25-Apr-20253.6 KiB11280

PaddingKernel.cppH A D25-Apr-202527.8 KiB730623

PixelShuffleKernel.cppH A D25-Apr-20259 KiB254197

PixelShuffleKernel.hH A D25-Apr-2025322 1510

PointwiseOpsKernel.cppH A D25-Apr-202510.2 KiB245222

PowKernel.cppH A D25-Apr-20255.1 KiB151125

README.mdH A D25-Apr-20254.7 KiB9171

RangeFactoriesKernel.cppH A D25-Apr-20252.8 KiB7864

Reduce.hH A D25-Apr-202511.9 KiB315250

ReduceAllOpsKernel.cppH A D25-Apr-20258.2 KiB228201

ReduceOpsKernel.cppH A D25-Apr-202516.8 KiB456374

ReduceUtils.hH A D25-Apr-20258.6 KiB239210

RenormKernel.cppH A D25-Apr-20251.2 KiB3931

SampledAddmmKernel.cppH A D25-Apr-20253.1 KiB10077

SampledAddmmKernel.hH A D25-Apr-2025323 137

ScatterGatherKernel.cppH A D25-Apr-202537.3 KiB972780

SerialStackImpl.hH A D25-Apr-20255.3 KiB147101

SoftMaxKernel.cppH A D25-Apr-202554.6 KiB1,3101,111

SoftmaxKernel.hH A D25-Apr-2025943 2921

SortingKernel.cppH A D25-Apr-20259.1 KiB273237

SparseFactories.cppH A D25-Apr-20252.3 KiB6657

SpmmReduceKernel.cppH A D25-Apr-202519.7 KiB566475

SpmmReduceKernel.hH A D25-Apr-20251.3 KiB2317

StackKernel.cppH A D25-Apr-2025659 2516

StackKernel.hH A D25-Apr-2025309 137

SumKernel.cppH A D25-Apr-202522.6 KiB652492

TensorCompareKernel.cppH A D25-Apr-202514.1 KiB417359

UnaryOpsKernel.cppH A D25-Apr-202535.7 KiB888777

Unfold2d.cppH A D25-Apr-202514.9 KiB452413

UnfoldBackwardKernel.cppH A D25-Apr-20254.7 KiB15389

UpSampleKernel.cppH A D25-Apr-202580.3 KiB2,0781,511

UpSampleKernelAVXAntialias.hH A D25-Apr-202556.8 KiB1,377737

UpSampleMoreKernel.cppH A D25-Apr-202537.7 KiB803705

WeightNormKernel.cppH A D25-Apr-202513.8 KiB444380

WeightNormKernel.hH A D25-Apr-2025552 2115

airy_ai.cppH A D25-Apr-2025738 2519

avx_mathfun.hH A D25-Apr-202517 KiB523312

batch_norm_kernel.cppH A D25-Apr-202559.7 KiB1,4041,165

group_norm_kernel.cppH A D25-Apr-202557.9 KiB1,5911,423

int4mm_kernel.cppH A D25-Apr-202525.3 KiB783598

int8mm_kernel.cppH A D25-Apr-202512.3 KiB439363

int_mm_kernel.hH A D25-Apr-2025583 1711

layer_norm_kernel.cppH A D25-Apr-202521.4 KiB616542

mixed_data_type.hH A D25-Apr-20251.4 KiB4230

moments_utils.hH A D25-Apr-20256.4 KiB203176

scaled_modified_bessel_k0.cppH A D25-Apr-2025878 2519

scaled_modified_bessel_k1.cppH A D25-Apr-2025878 2519

spherical_bessel_j0.cppH A D25-Apr-2025841 2519

utils.hH A D25-Apr-20257 KiB213167

zmath.hH A D25-Apr-20256.5 KiB251199

README.md

1The most important things to know:
2
3**Don't add a kernel to this folder unless you want it to be
4compiled multiple times for different instruction sets.**  Yes,
5this folder is named `cpu`, but that doesn't mean put any old
6CPU kernel it.  Only put CPU kernels which need to be compiled
7multiple times to take advantage of AVX512/AVX2/SSE instructions, but
8only on processors that support them.
9
10**Ensure that all implementations in this folder are put in an
11anonymous namespace.**  The files in this folder are compiled multiple
12times with different headers. It's important that these functions have
13internal linkage so that kernels for different architectures don't get
14combined during linking.  It's sufficient to label functions "static",
15but class methods must be an unnamed namespace to have internal linkage
16(since static means something different in the context of classes).
17
18**The basic recipe is to define your kernel, and then register
19it using DECLARE/REGISTER DISPATCH.**  Writing a kernel requires
20three steps:
21
221. Declare your dispatch in a header file using
23  `DECLARE_DISPATCH(fn_type, fnNameImpl);`
24   where `fn_type` is the function pointer type of the kernel (e.g.,
25   defined as `using fn_type = void(*)(Tensor&, const Tensor&)`
26   and `fnNameImpl` is the name of your dispatch registry.
27   (It doesn't really matter where you  put this declaration.)
28
292. Define your dispatch in a C++ file that is NOT in the cpu
30   directory (dispatch must be defined exactly once) using
31   `DEFINE_DISPATCH(fnNameImpl)` (matching the name of your declaration.)
32   Include the header file that declares the dispatch in this C++
33   file.  Conventionally, we define the dispatch in the same file
34   we will define our native function in.
35
363. Define a native function which calls into the dispatch using
37   `fnNameImpl(kCPU, arguments...)`, where the arguments are
38   the arguments according to the `fn_type` you defined in the
39   declaration.
40
414. Write your actual kernel (e.g., `your_kernel`) in the
42   cpu directory, and register it to
43   the dispatch using `REGISTER_DISPATCH(fnNameImpl, &your_kernel)`, if
44   it does not perform as well with AVX512, as it does with AVX2.
45   Otherwise, if it performs well with AVX512, register it with `ALSO_REGISTER_AVX512_DISPATCH(fnNameImpl, &your_kernel)`.
46   Compute-intensive kernels tend to perform better with AVX512, than with AVX2.
47   Comparing AVX2 & AVX512 variants of a kernel can be done by registering a kernel with `ALSO_REGISTER_AVX512_DISPATCH(fnNameImpl, &your_kernel)`, building from source, and then benchmarking the kernel's performance by running a benchmarking script with the environment variables `ATEN_CPU_CAPABILITY=avx2` and `ATEN_CPU_CAPABILITY=avx512`, respectively.
48   tcmalloc/jemalloc can be preloaded for minimal run-to-run variation.
49
50There are plenty of existing examples, look at them for more details.
51
52----
53
54TODO: Clarify and add more documentation all around.
55
56All of the `*.cpp` files in this folder will be compiled under all compiler
57flags specified by `CPU_CAPABILITY_FLAGS` in `aten/src/ATen/CMakeLists.txt`.
58
59The purpose of this is to allow the compilation with various compiler
60flags to enable features such as AVX2 or AVX512 instructions, while using
61runtime dispatch, which makes sure only valid instructions will be used on any
62given platform.
63
64vec.h provides a generic implementation of vec type that allows
65the programmer to write code packing various primitives (such as floats)
66within 256bit & 512bits registers. vec defines various operators such as
67+ and * and provides functions to allow operations such as max, min, etc.
68
69As an example `ReduceOpsKernel.cpp` implements a generic `kernel_` that reduces
70an entire array using a given associative binary operation such as +.
71
72More explicitly, calling `kernel_` with template argument `std::plus` will cause
73it to sum up the entire array into a single value.
74
75`ReduceOpsKernel.cpp` uses the `CPU_CAPABILITY_*` macros to "know" under which
76compiler flags it is currently compiled. This allows the programmer to write
77generic code, which will be compiled under multipled compilation settings.
78
79`../ReduceOps.cpp` now includes the header `ReduceOpsKernel.h`, which contains
80a generic definition of `sumImplAll`. This function allows the user to reduce
81over a dimension or all dimensions. The appropriate capability is chosen at
82runtime using cpuinfo. If the current platform has AVX2, `sumImpl` will be set
83to `sumImplAll<CPUCapability::AVX2>`.
84
85At runtime, the following environment variables control which codepath is taken:
86
87x64 options:
88ATEN_CPU_CAPABILITY=avx2    # Force AVX2 codepaths to be used
89ATEN_CPU_CAPABILITY=avx     # Force AVX codepaths to be used
90ATEN_CPU_CAPABILITY=default # Use oldest supported vector instruction set
91