cpu - OpenGrok cross reference for /aosp_15_r20/external/pytorch/aten/src/ATen/native/cpu/

The most important things to know:

**Don't add a kernel to this folder unless you want it to be
compiled multiple times for different instruction sets.**  Yes,
this folder is named `cpu`, but that doesn't mean put any old
CPU kernel it.  Only put CPU kernels which need to be compiled
multiple times to take advantage of AVX512/AVX2/SSE instructions, but
only on processors that support them.

**Ensure that all implementations in this folder are put in an
anonymous namespace.**  The files in this folder are compiled multiple
times with different headers. It's important that these functions have
internal linkage so that kernels for different architectures don't get
combined during linking.  It's sufficient to label functions "static",
but class methods must be an unnamed namespace to have internal linkage
(since static means something different in the context of classes).

**The basic recipe is to define your kernel, and then register
it using DECLARE/REGISTER DISPATCH.**  Writing a kernel requires
three steps:

1. Declare your dispatch in a header file using
  `DECLARE_DISPATCH(fn_type, fnNameImpl);`
   where `fn_type` is the function pointer type of the kernel (e.g.,
   defined as `using fn_type = void(*)(Tensor&, const Tensor&)`
   and `fnNameImpl` is the name of your dispatch registry.
   (It doesn't really matter where you  put this declaration.)

2. Define your dispatch in a C++ file that is NOT in the cpu
   directory (dispatch must be defined exactly once) using
   `DEFINE_DISPATCH(fnNameImpl)` (matching the name of your declaration.)
   Include the header file that declares the dispatch in this C++
   file.  Conventionally, we define the dispatch in the same file
   we will define our native function in.

3. Define a native function which calls into the dispatch using
   `fnNameImpl(kCPU, arguments...)`, where the arguments are
   the arguments according to the `fn_type` you defined in the
   declaration.

4. Write your actual kernel (e.g., `your_kernel`) in the
   cpu directory, and register it to
   the dispatch using `REGISTER_DISPATCH(fnNameImpl, &your_kernel)`, if
   it does not perform as well with AVX512, as it does with AVX2.
   Otherwise, if it performs well with AVX512, register it with `ALSO_REGISTER_AVX512_DISPATCH(fnNameImpl, &your_kernel)`.
   Compute-intensive kernels tend to perform better with AVX512, than with AVX2.
   Comparing AVX2 & AVX512 variants of a kernel can be done by registering a kernel with `ALSO_REGISTER_AVX512_DISPATCH(fnNameImpl, &your_kernel)`, building from source, and then benchmarking the kernel's performance by running a benchmarking script with the environment variables `ATEN_CPU_CAPABILITY=avx2` and `ATEN_CPU_CAPABILITY=avx512`, respectively.
   tcmalloc/jemalloc can be preloaded for minimal run-to-run variation.

There are plenty of existing examples, look at them for more details.

----

TODO: Clarify and add more documentation all around.

All of the `*.cpp` files in this folder will be compiled under all compiler
flags specified by `CPU_CAPABILITY_FLAGS` in `aten/src/ATen/CMakeLists.txt`.

The purpose of this is to allow the compilation with various compiler
flags to enable features such as AVX2 or AVX512 instructions, while using
runtime dispatch, which makes sure only valid instructions will be used on any
given platform.

vec.h provides a generic implementation of vec type that allows
the programmer to write code packing various primitives (such as floats)
within 256bit & 512bits registers. vec defines various operators such as
+ and * and provides functions to allow operations such as max, min, etc.

As an example `ReduceOpsKernel.cpp` implements a generic `kernel_` that reduces
an entire array using a given associative binary operation such as +.

More explicitly, calling `kernel_` with template argument `std::plus` will cause
it to sum up the entire array into a single value.

`ReduceOpsKernel.cpp` uses the `CPU_CAPABILITY_*` macros to "know" under which
compiler flags it is currently compiled. This allows the programmer to write
generic code, which will be compiled under multipled compilation settings.

`../ReduceOps.cpp` now includes the header `ReduceOpsKernel.h`, which contains
a generic definition of `sumImplAll`. This function allows the user to reduce
over a dimension or all dimensions. The appropriate capability is chosen at
runtime using cpuinfo. If the current platform has AVX2, `sumImpl` will be set
to `sumImplAll<CPUCapability::AVX2>`.

At runtime, the following environment variables control which codepath is taken:

x64 options:
ATEN_CPU_CAPABILITY=avx2    # Force AVX2 codepaths to be used
ATEN_CPU_CAPABILITY=avx     # Force AVX codepaths to be used
ATEN_CPU_CAPABILITY=default # Use oldest supported vector instruction set
Name		Date	Size	#Lines	LOC
..		-	-
Activation.cpp	H A D	25-Apr-2025	59.1 KiB	1,434	1,363
AdaptiveAvgPoolKernel.cpp	H A D	25-Apr-2025	31.8 KiB	863	706
AdaptiveMaxPoolKernel.cpp	H A D	25-Apr-2025	38.7 KiB	989	802
AmpGradScalerKernels.cpp	H A D	25-Apr-2025	8.1 KiB	199	146
AtomicAddFloat.h	H A D	25-Apr-2025	857	38	30
AvgPoolKernel.cpp	H A D	25-Apr-2025	40.3 KiB	1,139	946
BinaryOpsKernel.cpp	H A D	25-Apr-2025	53 KiB	1,439	1,345
BlasKernel.cpp	H A D	25-Apr-2025	15.9 KiB	541	475
CatKernel.cpp	H A D	25-Apr-2025	2.2 KiB	69	57
CatKernel.h	H A D	25-Apr-2025	307	13	8
ChannelShuffleKernel.cpp	H A D	25-Apr-2025	3.7 KiB	117	90
ChannelShuffleKernel.h	H A D	25-Apr-2025	287	15	10
ComplexKernel.cpp	H A D	25-Apr-2025	925	32	25
CopyKernel.cpp	H A D	25-Apr-2025	13.3 KiB	331	286
CopyKernel.h	H A D	25-Apr-2025	312	15	9
CrossKernel.cpp	H A D	25-Apr-2025	2.8 KiB	82	70
DepthwiseConvKernel.cpp	H A D	25-Apr-2025	18.6 KiB	526	444
DepthwiseConvKernel.h	H A D	25-Apr-2025	471	22	11
DistanceOpsKernel.cpp	H A D	25-Apr-2025	18.4 KiB	452	334
DistributionKernels.cpp	H A D	25-Apr-2025	10.6 KiB	251	194
DistributionTemplates.h	H A D	25-Apr-2025	16.1 KiB	426	362
FillKernel.cpp	H A D	25-Apr-2025	2.7 KiB	73	62
FlashAttentionKernel.cpp	H A D	25-Apr-2025	31.6 KiB	844	734
FunctionOfAMatrixUtilsKernel.cpp	H A D	25-Apr-2025	1.6 KiB	58	45
FusedAdagradKernel.cpp	H A D	25-Apr-2025	7 KiB	219	202
FusedAdamKernel.cpp	H A D	25-Apr-2025	14.2 KiB	369	338
FusedSGDKernel.cpp	H A D	25-Apr-2025	8.5 KiB	269	256
GridSamplerKernel.cpp	H A D	25-Apr-2025	56.2 KiB	1,323	877
GridSamplerKernel.h	H A D	25-Apr-2025	825	35	28
HistogramKernel.cpp	H A D	25-Apr-2025	13.1 KiB	315	176
IndexKernel.cpp	H A D	25-Apr-2025	30.4 KiB	799	608
IndexKernelUtils.h	H A D	25-Apr-2025	2.9 KiB	88	75
Intrinsics.h	H A D	25-Apr-2025	1.2 KiB	34	23
IsContiguous.h	H A D	25-Apr-2025	2.4 KiB	63	45
LerpKernel.cpp	H A D	25-Apr-2025	6.5 KiB	166	149
LinearAlgebraKernel.cpp	H A D	25-Apr-2025	2.7 KiB	90	76
LogAddExp.h	H A D	25-Apr-2025	2.4 KiB	62	44
Loops.h	H A D	25-Apr-2025	14.5 KiB	395	285
MaxPoolKernel.cpp	H A D	25-Apr-2025	27.7 KiB	748	619
MaxPooling.cpp	H A D	25-Apr-2025	1.8 KiB	65	54
MaxUnpoolKernel.cpp	H A D	25-Apr-2025	8.8 KiB	273	212
MaxUnpoolKernel.h	H A D	25-Apr-2025	308	15	9
MultinomialKernel.cpp	H A D	25-Apr-2025	8.7 KiB	246	178
NativeMultiheadAttnKernel.cpp	H A D	25-Apr-2025	3.6 KiB	112	80
PaddingKernel.cpp	H A D	25-Apr-2025	27.8 KiB	730	623
PixelShuffleKernel.cpp	H A D	25-Apr-2025	9 KiB	254	197
PixelShuffleKernel.h	H A D	25-Apr-2025	322	15	10
PointwiseOpsKernel.cpp	H A D	25-Apr-2025	10.2 KiB	245	222
PowKernel.cpp	H A D	25-Apr-2025	5.1 KiB	151	125
README.md	H A D	25-Apr-2025	4.7 KiB	91	71
RangeFactoriesKernel.cpp	H A D	25-Apr-2025	2.8 KiB	78	64
Reduce.h	H A D	25-Apr-2025	11.9 KiB	315	250
ReduceAllOpsKernel.cpp	H A D	25-Apr-2025	8.2 KiB	228	201
ReduceOpsKernel.cpp	H A D	25-Apr-2025	16.8 KiB	456	374
ReduceUtils.h	H A D	25-Apr-2025	8.6 KiB	239	210
RenormKernel.cpp	H A D	25-Apr-2025	1.2 KiB	39	31
SampledAddmmKernel.cpp	H A D	25-Apr-2025	3.1 KiB	100	77
SampledAddmmKernel.h	H A D	25-Apr-2025	323	13	7
ScatterGatherKernel.cpp	H A D	25-Apr-2025	37.3 KiB	972	780
SerialStackImpl.h	H A D	25-Apr-2025	5.3 KiB	147	101
SoftMaxKernel.cpp	H A D	25-Apr-2025	54.6 KiB	1,310	1,111
SoftmaxKernel.h	H A D	25-Apr-2025	943	29	21
SortingKernel.cpp	H A D	25-Apr-2025	9.1 KiB	273	237
SparseFactories.cpp	H A D	25-Apr-2025	2.3 KiB	66	57
SpmmReduceKernel.cpp	H A D	25-Apr-2025	19.7 KiB	566	475
SpmmReduceKernel.h	H A D	25-Apr-2025	1.3 KiB	23	17
StackKernel.cpp	H A D	25-Apr-2025	659	25	16
StackKernel.h	H A D	25-Apr-2025	309	13	7
SumKernel.cpp	H A D	25-Apr-2025	22.6 KiB	652	492
TensorCompareKernel.cpp	H A D	25-Apr-2025	14.1 KiB	417	359
UnaryOpsKernel.cpp	H A D	25-Apr-2025	35.7 KiB	888	777
Unfold2d.cpp	H A D	25-Apr-2025	14.9 KiB	452	413
UnfoldBackwardKernel.cpp	H A D	25-Apr-2025	4.7 KiB	153	89
UpSampleKernel.cpp	H A D	25-Apr-2025	80.3 KiB	2,078	1,511
UpSampleKernelAVXAntialias.h	H A D	25-Apr-2025	56.8 KiB	1,377	737
UpSampleMoreKernel.cpp	H A D	25-Apr-2025	37.7 KiB	803	705
WeightNormKernel.cpp	H A D	25-Apr-2025	13.8 KiB	444	380
WeightNormKernel.h	H A D	25-Apr-2025	552	21	15
airy_ai.cpp	H A D	25-Apr-2025	738	25	19
avx_mathfun.h	H A D	25-Apr-2025	17 KiB	523	312
batch_norm_kernel.cpp	H A D	25-Apr-2025	59.7 KiB	1,404	1,165
group_norm_kernel.cpp	H A D	25-Apr-2025	57.9 KiB	1,591	1,423
int4mm_kernel.cpp	H A D	25-Apr-2025	25.3 KiB	783	598
int8mm_kernel.cpp	H A D	25-Apr-2025	12.3 KiB	439	363
int_mm_kernel.h	H A D	25-Apr-2025	583	17	11
layer_norm_kernel.cpp	H A D	25-Apr-2025	21.4 KiB	616	542
mixed_data_type.h	H A D	25-Apr-2025	1.4 KiB	42	30
moments_utils.h	H A D	25-Apr-2025	6.4 KiB	203	176
scaled_modified_bessel_k0.cpp	H A D	25-Apr-2025	878	25	19
scaled_modified_bessel_k1.cpp	H A D	25-Apr-2025	878	25	19
spherical_bessel_j0.cpp	H A D	25-Apr-2025	841	25	19
utils.h	H A D	25-Apr-2025	7 KiB	213	167
zmath.h	H A D	25-Apr-2025	6.5 KiB	251	199