1The most important things to know: 2 3**Don't add a kernel to this folder unless you want it to be 4compiled multiple times for different instruction sets.** Yes, 5this folder is named `cpu`, but that doesn't mean put any old 6CPU kernel it. Only put CPU kernels which need to be compiled 7multiple times to take advantage of AVX512/AVX2/SSE instructions, but 8only on processors that support them. 9 10**Ensure that all implementations in this folder are put in an 11anonymous namespace.** The files in this folder are compiled multiple 12times with different headers. It's important that these functions have 13internal linkage so that kernels for different architectures don't get 14combined during linking. It's sufficient to label functions "static", 15but class methods must be an unnamed namespace to have internal linkage 16(since static means something different in the context of classes). 17 18**The basic recipe is to define your kernel, and then register 19it using DECLARE/REGISTER DISPATCH.** Writing a kernel requires 20three steps: 21 221. Declare your dispatch in a header file using 23 `DECLARE_DISPATCH(fn_type, fnNameImpl);` 24 where `fn_type` is the function pointer type of the kernel (e.g., 25 defined as `using fn_type = void(*)(Tensor&, const Tensor&)` 26 and `fnNameImpl` is the name of your dispatch registry. 27 (It doesn't really matter where you put this declaration.) 28 292. Define your dispatch in a C++ file that is NOT in the cpu 30 directory (dispatch must be defined exactly once) using 31 `DEFINE_DISPATCH(fnNameImpl)` (matching the name of your declaration.) 32 Include the header file that declares the dispatch in this C++ 33 file. Conventionally, we define the dispatch in the same file 34 we will define our native function in. 35 363. Define a native function which calls into the dispatch using 37 `fnNameImpl(kCPU, arguments...)`, where the arguments are 38 the arguments according to the `fn_type` you defined in the 39 declaration. 40 414. Write your actual kernel (e.g., `your_kernel`) in the 42 cpu directory, and register it to 43 the dispatch using `REGISTER_DISPATCH(fnNameImpl, &your_kernel)`, if 44 it does not perform as well with AVX512, as it does with AVX2. 45 Otherwise, if it performs well with AVX512, register it with `ALSO_REGISTER_AVX512_DISPATCH(fnNameImpl, &your_kernel)`. 46 Compute-intensive kernels tend to perform better with AVX512, than with AVX2. 47 Comparing AVX2 & AVX512 variants of a kernel can be done by registering a kernel with `ALSO_REGISTER_AVX512_DISPATCH(fnNameImpl, &your_kernel)`, building from source, and then benchmarking the kernel's performance by running a benchmarking script with the environment variables `ATEN_CPU_CAPABILITY=avx2` and `ATEN_CPU_CAPABILITY=avx512`, respectively. 48 tcmalloc/jemalloc can be preloaded for minimal run-to-run variation. 49 50There are plenty of existing examples, look at them for more details. 51 52---- 53 54TODO: Clarify and add more documentation all around. 55 56All of the `*.cpp` files in this folder will be compiled under all compiler 57flags specified by `CPU_CAPABILITY_FLAGS` in `aten/src/ATen/CMakeLists.txt`. 58 59The purpose of this is to allow the compilation with various compiler 60flags to enable features such as AVX2 or AVX512 instructions, while using 61runtime dispatch, which makes sure only valid instructions will be used on any 62given platform. 63 64vec.h provides a generic implementation of vec type that allows 65the programmer to write code packing various primitives (such as floats) 66within 256bit & 512bits registers. vec defines various operators such as 67+ and * and provides functions to allow operations such as max, min, etc. 68 69As an example `ReduceOpsKernel.cpp` implements a generic `kernel_` that reduces 70an entire array using a given associative binary operation such as +. 71 72More explicitly, calling `kernel_` with template argument `std::plus` will cause 73it to sum up the entire array into a single value. 74 75`ReduceOpsKernel.cpp` uses the `CPU_CAPABILITY_*` macros to "know" under which 76compiler flags it is currently compiled. This allows the programmer to write 77generic code, which will be compiled under multipled compilation settings. 78 79`../ReduceOps.cpp` now includes the header `ReduceOpsKernel.h`, which contains 80a generic definition of `sumImplAll`. This function allows the user to reduce 81over a dimension or all dimensions. The appropriate capability is chosen at 82runtime using cpuinfo. If the current platform has AVX2, `sumImpl` will be set 83to `sumImplAll<CPUCapability::AVX2>`. 84 85At runtime, the following environment variables control which codepath is taken: 86 87x64 options: 88ATEN_CPU_CAPABILITY=avx2 # Force AVX2 codepaths to be used 89ATEN_CPU_CAPABILITY=avx # Force AVX codepaths to be used 90ATEN_CPU_CAPABILITY=default # Use oldest supported vector instruction set 91