Item 30: Write more than unit tests

"All companies have test environments.

The lucky ones have production environments separate from the test environment." – @FearlessSon

Like most other modern languages, Rust includes features that make it easy to write tests that live alongside your code, and which give confidence that the code is working correctly.

This isn't the place to expound on the importance of tests; suffice it to say that if code isn't tested, it probably doesn't work the way you think it does. So this Item assumes that you're already signed up to write tests for your code.

Unit tests and integration tests, described in the next two sections, are the key forms of test. However, the Rust toolchain and extensions to it allow for various other types of test; this Item describes their distinct logistics and rationales.

Unit Tests

The most common form of test for Rust code is a unit test, which might look something like:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_nat_subtract() {
        assert_eq!(nat_subtract(4, 3).unwrap(), 1);
        assert_eq!(nat_subtract(4, 5), None);
    }

    #[should_panic]
    #[test]
    fn test_something_that_panics() {
        nat_subtract_unchecked(4, 5);
    }
}
}

Some aspects of this example will appear in every unit test:

  • a collection of unit test functions, which are…
  • marked with the #[test] attribute, and included within…
  • a #[cfg(test)] attribute, so the code only gets built in test configurations.

Other aspects of this example illustrate things that are optional, and may only be relevant for particular tests:

  • The test code here is held in a separate module, conventionally called tests or test. This module may be inline (as here), or held in a separate tests.rs file.
  • The test module may have a wildcard use super::* to pull in everything from the parent module under test. This makes it more convenient to add tests (and is an exception to the general advice of Item 23 to avoid wildcard imports).
  • A unit test has the ability to use anything from the parent module, whether it is pub or not. This allows for "whitebox" testing of the code, where the unit tests exercise internal features that aren't visible to normal users.
  • The test code makes use of unwrap() for its expected results; the advice of Item 18 isn't really relevant for test-only code, where panic! is used to signal a failing test. Similarly, the test code also checks expected results with assert_eq!, which will panic on failure.
  • The code under test includes a function that panics on some kinds of invalid input, and the tests exercise that in a test that's marked with the #[should_panic] attribute. This might be an internal function that normally expects the rest of the code to respect its invariants and preconditions, or it might be a public function that has some reason to ignore the advice of Item 18. (Such a function should have a "Panics" section in its doc comment, as described in Item 27.)

Item 27 suggests not documenting things that are already expressed by the type system; similarly, there's no need to test things that are guaranteed by the type system. If your enum types start start holding values that aren't in the list of allowed variants, you've got bigger problems than a failing unit test!

However, if your code relies on specific functionality from your dependencies, it can be helpful to include basic tests of that functionality. The aim here is not to repeat testing that's already done by the dependency itself, but instead to have an early warning system that indicates whether it's safe to include a new version of that dependency in practice – separately from whether the semantic version number (Item 21) indicates that the new version is safe in theory.

Integration Tests

The other common form of test included with a Rust project is integration tests, held under tests/. Each file in that directory is run as a separate test program that executes all of the functions marked with #[test].

Integration tests do not have access to crate internals, and so act as black-box tests that can only exercise the public API of the crate.

Doc Tests

Item 27 described the inclusion of short code samples in documentation comments, to illustrate the use of a particular public API item. Each such chunk of code is enclosed in an implicit fn main() { ... } and run as part of cargo test, effectively making it an additional test case for your code, known as a doc test. Individual tests can also be executed selectively by running cargo test --doc <item-name>.

Assuming that you regularly run tests as part of your continuous integration environment (Item 32), this ensures that your code samples don't drift too far from the current reality of your API.

Examples

Item 27 also described the ability to provide example programs that exercise your public API. Each Rust file under examples/ (or each subdirectory under examples/ that includes a main.rs) can be run as a standalone binary with cargo run --example <name> or cargo test --example <name>.

These programs only have access to the public API of your crate, and are intended to illustrate the use of your API as a whole. Examples are not specifically designated as test code (no #[test], no #[cfg(test)]), and they're a poor place to put code that exercises obscure nooks and crannies of your crate – particularly as examples are not run by cargo test by default.

Nevertheless, it's a good idea to ensure that your continuous integration system (Item 32) builds and runs all the associated examples for a crate (with cargo test --examples), because it can act as a good early warning system for regressions that are likely to affect lots of users. As noted above, if your examples demonstrate mainline use of your API, then a failure in the examples implies that something significant is wrong.

  • If it's a genuine bug, then it's likely to affect lots of users – the very nature of example code means that users are likely to have copied, pasted and adapted the example.
  • If it's an intended change to the API, then the examples need to be updated to match. A change to the API also implies a backwards incompatibility, so if the crate is published then the semantic version number needs a corresponding update to indicate this (Item 21).

The likelihood of users copying and pasting example code means that it should have a different style than test code. In line with Item 18, you should set a good example for your users by avoiding unwrap() calls for Results. Instead, make each example's main() function return something like Result<(), Box<dyn Error>>, and then use the question mark operator throughout (Item 3).

Benchmarks

Item 20 attempts to persuade you that fully optimizing the performance of your code isn't always necessary. Nevertheless, there are definitely still times when performance is critical, and if that's the case then it's a good idea to measure and track that performance. Having benchmarks that are run regularly (e.g. as part of continuous integration, Item 32) allows you to detect when changes to the code or the toolchains adversely affect that performance.

The cargo bench command1 runs special test cases that repeatedly perform an operation, and emits average timing information for the operation.

However, there's a danger that compiler optimizations may give misleading results, particularly if you restrict the operation that's being performed to a small subset of the real code. Consider a simple arithmetic function:

#![allow(unused)]
fn main() {
    pub fn factorial(n: u128) -> u128 {
        match n {
            0 => 1,
            n => n * factorial(n - 1),
        }
    }
}

A naïve benchmark for this code:

    #[bench]
    fn bench_factorial(b: &mut Bencher) {
        b.iter(|| {
            let result = factorial(15);
            assert_eq!(result, 1_307_674_368_000);
        });
    }

gives incredibly positive results:

test naive::bench_factorial       ... bench:           0 ns/iter (+/- 0)

With fixed inputs and a small amount of code under test, the compiler is able to optimize away the iteration and directly emit the result, leading to an unrealistically optimistic result.

The (experimental) std::hint::black_box function can help with this; it's an identity function whose implementation the compiler is "encouraged, but not required" (their italics) to pessimize.

Moving the code under test to use this hint:

#![allow(unused)]
#![feature(bench_black_box)] // nightly-only

fn main() {
pub fn factorial(n: u128) -> u128 {
    match n {
        0 => 1,
        n => n * std::hint::black_box(factorial(n - 1)),
    }
}
}

gives more realistic results:

test bench_factorial              ... bench:          42 ns/iter (+/- 6)

The Godbolt compiler explorer can also help by showing the actual machine code emitted by the compiler, which may make it obvious when the compiler has performed optimizations that would be unrealistic for code running a real scenario.

Finally, if you are including benchmarks for your Rust code, the Criterion crate may provide an alternative to the standard test::Bencher functionality which is:

  • more convenient (it runs with stable Rust)
  • more fully-featured (it has support for statistics and graphs).

Fuzz Testing

Fuzz testing is the process of exposing code to randomized inputs in the hope of finding bugs, particularly crashes that result from those inputs. Although this can be a useful technique in general, it becomes much more important when your code is exposed to inputs that may be controlled by someone who is deliberately trying to attack the code – so you should run fuzz tests if your code is exposed to potential attackers.

Historically, the majority of defects in C/C++ code that have been exposed by fuzzers have been memory safety problems, typically found by combining fuzz testing with runtime instrumentation (e.g. AddressSanitizer or ThreadSanitizer) of memory access patterns.

Rust is immune to some (but not all) of these memory safety problems, particularly when there is no unsafe code involved (Item 16). However, Rust does not prevent bugs in general, and a code path that triggers a panic! (cf. Item 18) can still result in a denial-of-service (DoS) attack on the codebase as a whole.

The most effective forms of fuzz testing are coverage-guided: the test infrastructure monitors which parts of the code are executed, and favours random mutations of the inputs that explore new code paths. "American fuzzy lop" (AFL) was the original heavyweight champion of this technique, but in more recent years equivalent functionality has been included into the LLVM toolchain as libFuzzer.

The Rust compiler is built on LLVM, and so the cargo-fuzz sub-command exposes libFuzzer functionality for Rust (albeit only for a limited number of platforms).

To set up a fuzz test, first identify an entrypoint of your code that takes (or can be adapted to take) arbitrary bytes of data as input:

#![allow(unused)]
fn main() {
/// Determine if the input starts with "FUZZ".
fn is_fuzz(data: &[u8]) -> bool {
    if data.len() >= 3 /* oops */
        && data[0] == b'F'
        && data[1] == b'U'
        && data[2] == b'Z'
        && data[3] == b'Z'
    {
        true
    } else {
        false
    }
}
}

Next, write a small driver that connects this entrypoint to the fuzzing infrastructure:

fuzz_target!(|data: &[u8]| {
    let _ = is_fuzz(data);
});

Running cargo +nightly fuzz run target1 continuously executes the fuzz target with random data, only stopping if a crash is found. In this case, a failure is found almost immediately:

INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 1139733386
INFO: Loaded 1 modules   (1596 inline 8-bit counters): 1596 [0x10cba9c60, 0x10cbaa29c), 
INFO: Loaded 1 PC tables (1596 PCs): 1596 [0x10cbaa2a0,0x10cbb0660), 
INFO:        7 files found in /Users/dmd/src/effective-rust/examples/testing/fuzz/corpus/target1
INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
INFO: seed corpus: files: 7 min: 1b max: 8b total: 34b rss: 38Mb
#8	INITED cov: 22 ft: 22 corp: 6/26b exec/s: 0 rss: 38Mb
thread '<unnamed>' panicked at 'index out of bounds: the len is 3 but the index is 3', fuzz_targets/target1.rs:11:12
stack backtrace:
   0: rust_begin_unwind
             at /rustc/f77bfb7336f21bfe6a5fb5f7358d4406e2597289/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/f77bfb7336f21bfe6a5fb5f7358d4406e2597289/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_bounds_check
             at /rustc/f77bfb7336f21bfe6a5fb5f7358d4406e2597289/library/core/src/panicking.rs:159:5
   3: _rust_fuzzer_test_input
   4: ___rust_try
   5: _LLVMFuzzerTestOneInput
   6: __ZN6fuzzer6Fuzzer15ExecuteCallbackEPKhm
   7: __ZN6fuzzer6Fuzzer6RunOneEPKhmbPNS_9InputInfoEbPb
   8: __ZN6fuzzer6Fuzzer16MutateAndTestOneEv
   9: __ZN6fuzzer6Fuzzer4LoopERNSt3__16vectorINS_9SizedFileENS_16fuzzer_allocatorIS3_EEEE
  10: __ZN6fuzzer12FuzzerDriverEPiPPPcPFiPKhmE
  11: _main

and the input that triggered the failure is emitted.

Normally, fuzz testing does not find failures so quickly, and so it does not make sense to run fuzz tests as part of your continuous integration. The open-ended nature of the testing, and the consequent compute costs, mean that you need to consider how and when to run fuzz tests – perhaps only for new releases or major changes, or perhaps for a limited period of time2.

You can also make subsequent runs of the fuzzing infrastructure more efficient, by storing and re-using a corpus of previous inputs which the fuzzer found to explore new code paths; this helps subsequent runs of the fuzzer explore new ground, rather than re-testing code paths previously visited.

Testing Advice

An Item about testing wouldn't be complete without repeating some common advice (which is mostly not Rust-specific):

  • As this Item has endlessly repeated, run all your tests in continuous integration on every change (with the exception of fuzz tests).
  • When you're fixing a bug, write a test that exhibits the bug before fixing the bug. That way you can be sure that the bug is fixed, and that it won't be accidentally re-introduced in future.
  • If your crate has features (Item 26), run tests over every possible combination of available features.
  • More generally, if your crate includes any config-specific code (e.g. #[cfg(target_os = "windows")]), run tests for every platform that has distinct code.

Summary

This Item has covered a lot of different types of test, so a summary is in order:

  • Write unit tests for comprehensive testing that includes testing of internal-only code; run with cargo test.
  • Write integration tests to exercise your public API; run with cargo test.
  • Write doc tests that exemplify how to use individual items in your public API; run with cargo test.
  • Write example programs that show how to use your public API as a whole; run with cargo test --examples or cargo run --example <name>.
  • Write benchmarks if your code has significant performance requirements; run with cargo bench.
  • Write fuzz tests if your code is exposed to untrusted inputs; run (continuously) with cargo fuzz.

That's a lot of different types of test, so it's up to you how much each of them is relevant and worthwhile for your project.

If you have a lot of test code and you are publishing your crate to crates.io, then you might need to consider which of the tests make sense to include in the published crate. By default, cargo will include unit tests, integration tests, benchmarks and examples (but not fuzz tests), which may be more than end users need. If that's the case, you can either exclude some of the files, or (for black-box tests) move the tests out of the crate and into a separate test crate.


1: Support for benchmarks is not stable, so the command may need to be cargo +nightly bench.

2: If your code is a widely-used open-source crate, the Google OSS-Fuzz program may be willing to run fuzzing on your behalf.