1# gemmlowp: a small self-contained low-precision GEMM library 2 3[](http://travis-ci.org/google/gemmlowp) 4 5This is not a full linear algebra library, only a GEMM library: it only does 6general matrix multiplication ("GEMM"). 7 8The meaning of "low precision" is detailed in this document: 9[doc/low-precision.md](doc/low-precision.md) 10 11Some of the general design is explained in [doc/design.md](doc/design.md). 12 13**Warning:** This library goes very slow if compiled incorrectly; see below. 14 15## Disclaimer 16 17This is not an official Google product (experimental or otherwise), it is just 18code that happens to be owned by Google. 19 20## Mailing list 21 22gemmlowp-related discussion, about either development or usage, is welcome on 23this Google Group (mailing list / forum): 24 25https://groups.google.com/forum/#!forum/gemmlowp 26 27## Portability, target platforms/architectures 28 29Should be portable to any platform with some C++11 and POSIX support, while we 30have optional optimized code paths for specific architectures. 31 32Required: 33 34* C++11 (a small conservative subset of it) 35 36Required for some features: 37 38* Some POSIX interfaces: 39 * pthreads (for multi-threaded operation and for profiling). 40 * sysconf (for multi-threaded operation to detect number of cores; may be 41 bypassed). 42 43Optional: 44 45* Architecture-specific code paths use intrinsics or inline assembly. See 46 "Architecture-specific optimized code paths" below. 47 48## Architecture-specific optimized code paths 49 50We have some optimized code paths for specific instruction sets. Some are 51written in inline assembly, some are written in C++ using intrinsics. Both GCC 52and Clang are supported. 53 54Current optimized code paths: 55 56* ARM with NEON (both 32bit and 64bit). 57* Intel x86 with SSE 4.1 (both 32bit and 64bit). 58 59When building for x86, it's very important to pass `-msse4.1` to the compiler, 60otherwise gemmlowp will use slow reference code. Bazel users can compile by 61running `bazel build --copt=-msse4.1 //gemmlowp:all`. The compiled binary should 62work on all Intel CPUs since 2008 (including low power microarchitectures) as 63well as AMD CPUs since 2011. 64 65Please note when compiling binaries that don't need to be distributed, it's 66generally a better idea to pass `-march=native` to the compiler. That flag 67implies `-msse4.1` flag, along with others that might be helpful. This of course 68assumes the host machine supports those instructions. Bazel users should prefer 69to run `bazel build --config=opt //gemmlowp:all` instead. 70 71Details of what it takes to make an efficient port of gemmlowp, namely writing a 72suitable GEMM kernel and accompanying packing code, are explained in this file: 73[doc/kernel.md](doc/kernel.md). 74 75## Public interfaces 76 77### The gemmlowp public interface 78 79gemmlowp's main public interface is in the `public/` subdirectory. 80 81This is a headers-only library, so there is nothing to link to. 82 83Usage documentation, and comments on the deprecation status of each public entry 84point, may be found in [doc/public.md](doc/public.md) . 85 86A full, self-contained usage example, showing how to quantize float matrices and 87perform a quantized matrix multiplication approximating a float matrix 88multiplication, is given in 89[doc/quantization_example.cc](doc/quantization_example.cc). 90 91### Old EightBitIntGemm legacy deprecated interface 92 93The `eight_bit_int_gemm/` subdirectory contains an alternate interface that 94should be considered purely legacy, deprecated, and going to be removed at some 95point in the future. 96 97## Building 98 99### Building by manually invoking your compiler 100 101Because gemmlowp is so simple, working with it involves only single-command-line 102compiler invocations. Therefore we expect that most people working with gemmlowp 103will either manually invoke their compiler, or write their own rules for their 104own preferred build system. 105 106Keep in mind (previous section) that gemmlowp itself is a pure-headers-only 107library so there is nothing to build. 108 109For a Android gemmlowp development workflow, the `scripts/` directory contains a 110script to build and run a program on an Android device: 111 112``` 113scripts/test-android.sh 114``` 115 116### Building using Bazel 117 118That being said, we also maintain a Bazel BUILD system as part of gemmlowp. Its 119usage is not mandatory at all and is only one possible way that gemmlowp 120libraries and tests may be built. If you are interested, Bazel's home page is 121http://bazel.build/ And you can get started with using Bazel to build gemmlowp 122targets by first creating an empty WORKSPACE file in a parent directory, for 123instance: 124 125``` 126$ cd gemmlowp/.. # change to parent directory containing gemmlowp/ 127$ touch WORKSPACE # declare that to be our workspace root 128$ bazel build gemmlowp:all 129``` 130 131## Testing 132 133### Testing by manually building and running tests 134 135The test/ directory contains unit tests. The primary unit test is 136 137``` 138test/test.cc 139``` 140 141Since it covers also the EightBitIntGemm interface, it needs to be linked 142against 143 144``` 145eight_bit_int_gemm/eight_bit_int_gemm.cc 146``` 147 148It also uses realistic data captured from a neural network run in 149 150``` 151test/test_data.cc 152``` 153 154Thus you'll want to pass the following list of source files to your 155compiler/linker: 156 157``` 158test/test.cc 159eight_bit_int_gemm/eight_bit_int_gemm.cc 160test/test_data.cc 161``` 162 163The `scripts/` directory contains a script to build and run a program on an 164Android device: 165 166``` 167scripts/test-android.sh 168``` 169 170It expects the `CXX` environment variable to point to an Android toolchain's C++ 171compiler, and expects source files (and optionally, cflags) as command-line 172parameters. To build and run the above-mentioned main unit test, first set `CXX` 173e.g.: 174 175``` 176$ export CXX=/some/toolchains/arm-linux-androideabi-4.8/bin/arm-linux-androideabi-g++ 177``` 178 179Then run: 180 181``` 182$ ./scripts/test-android.sh \ 183test/test.cc \ 184eight_bit_int_gemm/eight_bit_int_gemm.cc \ 185test/test_data.cc 186``` 187 188### Testing using Bazel 189 190Alternatively, you can use Bazel to build and run tests. See the Bazel 191instruction in the above section on building. Once your Bazel workspace is set 192up, you can for instance do: 193 194``` 195$ bazel test gemmlowp:all 196``` 197 198## Troubleshooting Compilation 199 200If you're having trouble finding the compiler, follow these instructions to 201build a standalone toolchain: 202https://developer.android.com/ndk/guides/standalone_toolchain.html 203 204Here's an example of setting up Clang 3.5: 205 206``` 207$ export INSTALL_DIR=~/toolchains/clang-21-stl-gnu 208$ $NDK/build/tools/make-standalone-toolchain.sh \ 209--toolchain=arm-linux-androideabi-clang3.5 --platform=android-21 \ 210--install-dir=$INSTALL_DIR 211$ export CXX="$INSTALL_DIR/bin/arm-linux-androideabi-g++ \ 212--sysroot=$INSTALL_DIR/sysroot" 213``` 214 215Some compilers (e.g. the default clang++ in the same bin directory) don't 216support NEON assembly. The benchmark build process will issue a warning if 217support isn't detected, and you should make sure you're using a compiler like 218arm-linux-androideabi-g++ that does include NEON. 219 220## Benchmarking 221 222The main benchmark is 223 224``` 225test/benchmark.cc 226``` 227 228It doesn't need to be linked to any other source file. We recommend building 229with assertions disabled (`-DNDEBUG`). 230 231For example, the benchmark can be built and run on an Android device by doing: 232 233``` 234$ ./scripts/test-android.sh test/benchmark.cc -DNDEBUG 235``` 236 237If `GEMMLOWP_TEST_PROFILE` is defined then the benchmark will be built with 238profiling instrumentation (which makes it slower) and will dump profiles. See 239next section on profiling. 240 241## Profiling 242 243The `profiling/` subdirectory offers a very simple, naive, inaccurate, 244non-interrupting sampling profiler that only requires pthreads (no signals). 245 246It relies on source code being instrumented with pseudo-stack labels. See 247`profiling/instrumentation.h`. A full example of using this profiler is given in 248the top comment of `profiling/profiler.h`. 249 250## Contributing 251 252Contribution-related discussion is always welcome on the gemmlowp mailing list 253(see above). 254 255We try to keep a current list of TODO items in the `todo/` directory. 256Prospective contributors are welcome to pick one to work on, and communicate 257about it on the gemmlowp mailing list. 258 259Details of the contributing process, including legalese, are in CONTRIBUTING. 260 261## Performance goals 262 263Our performance goals differ from typical GEMM performance goals in the 264following ways: 265 2661. We care not only about speed, but also about minimizing power usage. We 267 specifically care about charge usage in mobile/embedded devices. This 268 implies that we care doubly about minimizing memory bandwidth usage: we care 269 about it, like any GEMM, because of the impact on speed, and we also care 270 about it because it is a key factor of power usage. 271 2722. Most GEMMs are optimized primarily for large dense matrix sizes (>= 1000). 273 We do care about large sizes, but we also care specifically about the 274 typically smaller matrix sizes encountered in various mobile applications. 275 This means that we have to optimize for all sizes, not just for large enough 276 sizes. 277