testing/libfuzzer/efficient_fuzzing.md

# Efficient Fuzzing Guide

This relates to fuzzers created using [libfuzzer] not [FuzzTests] - none of this
advice is necessary for FuzzTests.

Once you have a fuzz target running, you can analyze and tweak it to improve its
efficiency. This document describes techniques to minimize fuzzing time and
maximize your results.

*** note
**Note:** If you haven’t created your first fuzz target yet, see the [Getting
Started Guide].
***

The most direct way to gauge the effectiveness of your fuzz target is to collect
metrics. You can get them manually, or take them from a [ClusterFuzz status]
page after your fuzz target is checked into the Chromium repository.

[TOC]

## Key metrics of a fuzz target

### Execution speed

A fuzzing engine such as libFuzzer typically explores a large search space by
performing randomized mutations, so it needs to run as fast as possible to find
interesting code paths.

Fuzz target speed is calculated in executions per second (`exec/s`). It is
printed while a fuzz target is running:

```
#11002  NEW    cov: 1337 ft: 10934 corp: 707/409Kb lim: 1098 exec/s: 5333 rss: 27Mb L: 186/1098
```

You should aim for at least 1,000 exec/s from your fuzz target locally before
submitting it to the Chromium repository. If you’re under 1,000, consider the
following improvements:

* [Simplifying initialization/cleanup](#Simplifying-initialization-cleanup)
* [Minimizing memory usage](#Minimizing-memory-usage)

#### Simplifying initialization/cleanup

If your `LLVMFuzzerTestOneInput` function is too complex, it can decrease the
fuzzer’s execution speed. It can also cause the fuzzer to target specific
use-cases or fail to account for unexpected scenarios.

Instead of performing setup and teardown on each input, use static
initialization and shared resources. Check out this [startup initialization] in
libFuzzer’s documentation for an example.

*** note
**Note:** You can skip freeing static resources. However, all other resources
allocated within the `LLVMFuzzerTestOneInput` function should be de-allocated,
since the function gets called millions of times during a fuzzing session. If
you don’t, you’ll often run out of memory and reduce overall fuzzing efficiency.
***

#### Minimizing memory usage

Avoid allocation of dynamic memory wherever possible. Memory instrumentation
works faster for stack-based and static objects than for heap-allocated ones.

*** note
**Note:** It’s always a good idea to try different variants for your fuzz target
locally, then submit only the fastest implementation to the Chromium repository.
***

### Code coverage

You can check the percentage of code covered by your fuzz target to gauge
fuzzing effectiveness:

* Review aggregated Chrome coverage from recent runs by checking the [fuzzing
  coverage] report. This report can provide insight on how to improve code
  coverage.
* Generate a source-level coverage report for your fuzzer by running the
  [coverage script] stored in the Chromium repository. The script provides
  detailed instructions and a usage example.

For the `out/coverage` target in the coverage script, make sure to add all of
the gn args you needed to build the `out/libfuzzer` target; this could include
args like `target_os=chromeos` and `is_asan=true` depending on the [gn config]
you chose.

*** note
**Note:** The code coverage of a fuzz target depends heavily on the corpus. A
well-chosen corpus will produce much greater code coverage. On the other hand,
a coverage report generated by a fuzz target without a corpus won't cover much
code. If you don’t have a corpus to use, you can download the [corpus from
ClusterFuzz]. For more information on the corpus, see
[Corpus Size](#Corpus-Size).
***

### Corpus size

A guided fuzzing engine such as libFuzzer considers an input (a.k.a. testcase
or corpus unit) *interesting* if the input results in new code coverage (i.e.,
if the fuzzer reaches code that has not been reached before). The set of all
interesting inputs is called the *corpus*. A corpus is shared across fuzzer runs
and grows over time.

If a fuzz target stops discovering new interesting inputs after running for a
while, it typically indicates that the fuzz target is hitting a code barrier
(also called a *coverage plateau*). The corpus for a reasonably complex target
should contain hundreds (if not thousands) of inputs.

If a fuzz target reaches coverage plateau with a small corpus, the common causes
are checksums and magic numbers. Or, it may be impossible for your fuzzer to
reach a lot of code. The easiest way to diagnose the problem is to generate and
analyze a [coverage report](#code-coverage). Then, to fix the issue, try the
following:

* Change the code (e.g., disable CRC checks while fuzzing) with a
  [custom build](#Custom-build).
* Prepare or improve the [seed corpus](#Seed-corpus).
* Prepare or improve the [fuzzer dictionary](#Fuzzer-dictionary).

## Ways to improve a fuzz target

### Seed corpus

You can give your fuzz target a starting point by creating a set of valid and
interesting inputs called a *seed corpus*. If you don’t provide a seed corpus,
the fuzzing engine has to guess inputs from scratch, which can take time
(depending on the size of the inputs and the complexity of the target format).
In many cases, providing a seed corpus can increase code coverage by an order of
magnitude.

Seed corpuses work especially well for strictly defined file formats and data
transmission protocols:

* For file format parsers, add valid files from your test suite.
* For protocol parsers, add valid raw streams from a test suite into separate
  files.
* For graphics libraries, add a variety of small PNG/JPG/GIF files.

#### Using a corpus locally

If you’re running a fuzz target locally, you can easily designate a corpus by
passing a directory as an argument:

```
./out/libfuzzer/my_fuzzer ~/tmp/my_fuzzer_corpus
```

The fuzzer stores all the interesting inputs it finds in the directory.

#### Creating a Chromium repository seed corpus

When running fuzz targets at scale, ClusterFuzz looks for a seed corpus defined
in the Chromium source repository. You can define one in your `BUILD.gn` file by
adding a `seed_corpus` attribute to your `fuzzer_test` target definition:

```
fuzzer_test("my_fuzzer") {
  ...
  seed_corpus = "test/fuzz/testcases"
  ...
}
```

If you want to specify multiple seed corpus directories, use the `seed_corpuses`
attribute instead:

```
fuzzer_test("my_fuzzer") {
  ...
  seed_corpuses = [ "test/fuzz/testcases", "test/unittest/data" ]
  ...
}
```

All files found in these directories and their subdirectories are stored in a
`<my_fuzzer>_seed_corpus.zip` output archive.

#### Uploading corpus files to GCS

If you can't store your seed corpus in the Chromium repository (e.g., it’s too
large, can’t be open-sourced, etc.), you can upload the corpus to the Google
Cloud Storage (GCS) bucket used by ClusterFuzz.

1) Open the [Corpus GCS Bucket] in your browser.
2) Search for the directory named `<my_fuzzer>`. If the directory does not
   exist, create it.
3) In the `<my_fuzzer>` directory, upload your corpus files.

*** note
**Note:** If you upload your corpus to GCS, you don’t need to add the
`seed_corpus` attribute to your `fuzzer_test` target definition. However, adding
seed corpus to the Chromium repository is the preferred way.
***

You can do the same thing by using the [gsutil] command line tool:

```bash
gsutil -m rsync <path_to_corpus> gs://clusterfuzz-corpus/libfuzzer/<my_fuzzer>
```

*** note
**Note:** To write to this bucket using `gsutil`, you must be logged into your
@google.com account (@chromium.org will not work). You can use the `gcloud auth
login` command to log into your account in `gsutil` if you installed `gsutil`
through `gcloud`.
***

#### Minimizing a seed corpus

Your seed corpus is synced to all fuzzing bots for every iteration, so it's
important to minimize it to a small set of interesting inputs before uploading.
Keeping the seed corpus small improves fuzzing efficiency and prevents our bots
from running out of disk space.

You can minimize your seed corpus by using libFuzzer’s `-merge=1` option:

```bash
# Create an empty directory.
mkdir seed_corpus_minimized

# Run the fuzzer with -merge=1 flag.
./my_fuzzer -merge=1 ./seed_corpus_minimized ./seed_corpus
```

After running the command, the `seed_corpus_minimized` directory will contain a
minimized corpus that gives the same code coverage as your initial `seed_corpus`
directory.

### Fuzzer dictionary

You can help your fuzzer increase its coverage by providing a set of common
words or values that you expect to find in the input. Such a dictionary works
especially well for certain use-cases (e.g., fuzzing file format decoders or
text-based protocols like XML).

Add a fuzzer dictionary:

1) Create a flat ASCII text file that lists one input token per line in the
   format `name="value"`. The value must appear in quotes with hex escaping
   (`\xNN`) applied to all non-printable, high-bit, or otherwise problematic
   characters (`\` and `"` shorthands are recognized, too). This syntax is
   similar to the one used by the [AFL] fuzzing engine (`-x` option).

   *** note
   **Note:** `name` can be omitted, but it is a convenient way to document the
   meaning of each token. Here’s an example dictionary:
   ***

   ```
   # Lines starting with '#' and empty lines are ignored.

   # Adds "blah" word (w/o quotes) to the dictionary.
   kw1="blah"
   # Use \\ for backslash and \" for quotes.
   kw2="\"ac\\dc\""
   # Use \xAB for hex values.
   kw3="\xF7\xF8"
   # Key name before '=' can be omitted:
   "foo\x0Abar"
   ```

2) Test your dictionary by running your fuzz target locally:

   ```bash
   ./out/libfuzzer/my_fuzzer -dict=<path_to_dict> <path_to_corpus>
   ```

   If the dictionary is effective, you should see `NEW` units discovered in the
   output.

3) Add the dictionary file in the same directory as your fuzz target, then add
   the `dict` attribute to the `fuzzer_test` definition in your `BUILD.gn` file:

   ```
   fuzzer_test("my_fuzzer") {
     ...
     dict = "my_fuzzer.dict"
   }
   ```

   The dictionary is submitted to the Chromium repository. Once ClusterFuzz
   picks up a new revision build, the dictionary is used automatically.

### Custom build

If you need to change the code being tested by your fuzz target, you can use an
`#ifdef FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION` macro in your target code.

*** note
**Note:** Patching target code is not a preferred way of improving the
corresponding fuzz target, but in some cases it might be the only way to do it
(e.g., when there is no intended API to disable checksum verification, or when
the target code uses a random generator that affects the reproducibility of
crashes).
***

[AFL]: http://lcamtuf.coredump.cx/afl/
[ClusterFuzz status]: libFuzzer_integration.md#Status-Links
[Corpus GCS Bucket]: https://console.cloud.google.com/storage/clusterfuzz-corpus/libfuzzer
[Getting Started Guide]: getting_started.md
[gn config]: getting_started.md#running-the-fuzz-target
[corpus from ClusterFuzz]: libFuzzer_integration.md#Corpus
[coverage script]: https://cs.chromium.org/chromium/src/tools/code_coverage/coverage.py
[fuzzing coverage]: https://analysis.chromium.org/coverage/p/chromium?platform=fuzz
[gsutil]: https://cloud.google.com/storage/docs/gsutil
[startup initialization]: https://llvm.org/docs/LibFuzzer.html#startup-initialization
[libfuzzer]: getting_started_with_libfuzzer.md
[fuzztests]: getting_started.md