README.md
1The most important things to know:
2
3**Don't add a kernel to this folder unless you want it to be
4compiled multiple times for different instruction sets.** Yes,
5this folder is named `cpu`, but that doesn't mean put any old
6CPU kernel it. Only put CPU kernels which need to be compiled
7multiple times to take advantage of AVX512/AVX2/SSE instructions, but
8only on processors that support them.
9
10**Ensure that all implementations in this folder are put in an
11anonymous namespace.** The files in this folder are compiled multiple
12times with different headers. It's important that these functions have
13internal linkage so that kernels for different architectures don't get
14combined during linking. It's sufficient to label functions "static",
15but class methods must be an unnamed namespace to have internal linkage
16(since static means something different in the context of classes).
17
18**The basic recipe is to define your kernel, and then register
19it using DECLARE/REGISTER DISPATCH.** Writing a kernel requires
20three steps:
21
221. Declare your dispatch in a header file using
23 `DECLARE_DISPATCH(fn_type, fnNameImpl);`
24 where `fn_type` is the function pointer type of the kernel (e.g.,
25 defined as `using fn_type = void(*)(Tensor&, const Tensor&)`
26 and `fnNameImpl` is the name of your dispatch registry.
27 (It doesn't really matter where you put this declaration.)
28
292. Define your dispatch in a C++ file that is NOT in the cpu
30 directory (dispatch must be defined exactly once) using
31 `DEFINE_DISPATCH(fnNameImpl)` (matching the name of your declaration.)
32 Include the header file that declares the dispatch in this C++
33 file. Conventionally, we define the dispatch in the same file
34 we will define our native function in.
35
363. Define a native function which calls into the dispatch using
37 `fnNameImpl(kCPU, arguments...)`, where the arguments are
38 the arguments according to the `fn_type` you defined in the
39 declaration.
40
414. Write your actual kernel (e.g., `your_kernel`) in the
42 cpu directory, and register it to
43 the dispatch using `REGISTER_DISPATCH(fnNameImpl, &your_kernel)`, if
44 it does not perform as well with AVX512, as it does with AVX2.
45 Otherwise, if it performs well with AVX512, register it with `ALSO_REGISTER_AVX512_DISPATCH(fnNameImpl, &your_kernel)`.
46 Compute-intensive kernels tend to perform better with AVX512, than with AVX2.
47 Comparing AVX2 & AVX512 variants of a kernel can be done by registering a kernel with `ALSO_REGISTER_AVX512_DISPATCH(fnNameImpl, &your_kernel)`, building from source, and then benchmarking the kernel's performance by running a benchmarking script with the environment variables `ATEN_CPU_CAPABILITY=avx2` and `ATEN_CPU_CAPABILITY=avx512`, respectively.
48 tcmalloc/jemalloc can be preloaded for minimal run-to-run variation.
49
50There are plenty of existing examples, look at them for more details.
51
52----
53
54TODO: Clarify and add more documentation all around.
55
56All of the `*.cpp` files in this folder will be compiled under all compiler
57flags specified by `CPU_CAPABILITY_FLAGS` in `aten/src/ATen/CMakeLists.txt`.
58
59The purpose of this is to allow the compilation with various compiler
60flags to enable features such as AVX2 or AVX512 instructions, while using
61runtime dispatch, which makes sure only valid instructions will be used on any
62given platform.
63
64vec.h provides a generic implementation of vec type that allows
65the programmer to write code packing various primitives (such as floats)
66within 256bit & 512bits registers. vec defines various operators such as
67+ and * and provides functions to allow operations such as max, min, etc.
68
69As an example `ReduceOpsKernel.cpp` implements a generic `kernel_` that reduces
70an entire array using a given associative binary operation such as +.
71
72More explicitly, calling `kernel_` with template argument `std::plus` will cause
73it to sum up the entire array into a single value.
74
75`ReduceOpsKernel.cpp` uses the `CPU_CAPABILITY_*` macros to "know" under which
76compiler flags it is currently compiled. This allows the programmer to write
77generic code, which will be compiled under multipled compilation settings.
78
79`../ReduceOps.cpp` now includes the header `ReduceOpsKernel.h`, which contains
80a generic definition of `sumImplAll`. This function allows the user to reduce
81over a dimension or all dimensions. The appropriate capability is chosen at
82runtime using cpuinfo. If the current platform has AVX2, `sumImpl` will be set
83to `sumImplAll<CPUCapability::AVX2>`.
84
85At runtime, the following environment variables control which codepath is taken:
86
87x64 options:
88ATEN_CPU_CAPABILITY=avx2 # Force AVX2 codepaths to be used
89ATEN_CPU_CAPABILITY=avx # Force AVX codepaths to be used
90ATEN_CPU_CAPABILITY=default # Use oldest supported vector instruction set
91