Item 30: Write more than unit tests

"All companies have test environments.

The lucky ones have production environments separate from the test environment." – @FearlessSon

Like most other modern languages, Rust includes features that make it easy to write tests that live alongside your code and that give you confidence that the code is working correctly.

This isn't the place to expound on the importance of tests; suffice it to say that if code isn't tested, it probably doesn't work the way you think it does. So this Item assumes that you're already signed up to write tests for your code.

Unit tests and integration tests, described in the next two sections, are the key forms of tests. However, the Rust toolchain, and extensions to the toolchain, allow for various other types of tests. This Item describes their distinct logistics and rationales.

Unit Tests

The most common form of test for Rust code is a unit test, which might look something like this:

#![allow(unused)]
fn main() {
// ... (code defining `nat_subtract*` functions for natural
//      number subtraction)

#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_nat_subtract() {
        assert_eq!(nat_subtract(4, 3).unwrap(), 1);
        assert_eq!(nat_subtract(4, 5), None);
    }

    #[should_panic]
    #[test]
    fn test_something_that_panics() {
        nat_subtract_unchecked(4, 5);
    }
}
}

Some aspects of this example will appear in every unit test:

  • A collection of unit test functions.
  • Each test function is marked with the #[test] attribute.
  • The module holding the test functions is annotated with a #[cfg(test)] attribute, so the code gets built only in test configurations.

Other aspects of this example illustrate things that are optional and may be relevant only for particular tests:

  • The test code here is held in a separate module, conventionally called tests or test. This module may be inline (as here) or held in a separate tests.rs file. Using a separate file for the test module has the advantage that it's easier to spot whether code that uses a function is test code or "real" code.
  • The test module might have a wildcard use super::* to pull in everything from the parent module under test. This makes it more convenient to add tests (and is an exception to the general advice in Item 23 to avoid wildcard imports).
  • The normal visibility rules for modules mean that a unit test has the ability to use anything from the parent module, whether it is pub or not. This allows for "open-box" testing of the code, where the unit tests exercise internal features that aren't visible to normal users.
  • The test code makes use of expect() or unwrap() for its expected results. The advice in Item 18 isn't really relevant for test-only code, where panic! is used to signal a failing test. Similarly, the test code also checks expected results with assert_eq!, which will panic on failure.
  • The code under test includes a function that panics on some kinds of invalid input; to exercise that, there's a unit test function that's marked with the #[should_panic] attribute. This might be needed when testing an internal function that normally expects the rest of the code to respect its invariants and preconditions, or it might be a public function that has some reason to ignore the advice in Item 18. (Such a function should have a "Panics" section in its doc comment, as described in Item 27.)

Item 27 suggests not documenting things that are already expressed by the type system. Similarly, there's no need to test things that are guaranteed by the type system. If your enum types start holding values that aren't in the list of allowed variants, you've got bigger problems than a failing unit test!

However, if your code relies on specific functionality from your dependencies, it can be helpful to include basic tests of that functionality. The aim here is not to repeat testing that's already done by the dependency itself but instead to have an early warning system that indicates whether the behavior that you need from the dependency has changed—separately from whether the public API signature has changed, as indicated by the semantic version number (Item 21).

Integration Tests

The other common form of test included with a Rust project is integration tests, held under tests/. Each file in that directory is run as a separate test program that executes all of the functions marked with #[test].

Integration tests do not have access to crate internals and so act as behavior tests that can exercise only the public API of the crate.

Doc Tests

Item 27 described the inclusion of short code samples in documentation comments, to illustrate the use of a particular public API item. Each such chunk of code is enclosed in an implicit fn main() { ... } and run as part of cargo test, effectively making it an additional test case for your code, known as a doc test. Individual tests can also be executed selectively by running cargo test --doc <item-name>.

Regularly running tests as part of your CI environment (Item 32) ensures that your code samples don't drift too far from the current reality of your API.

Examples

Item 27 also described the ability to provide example programs that exercise your public API. Each Rust file under examples/ (or each subdirectory under examples/ that includes a main.rs) can be run as a standalone binary with cargo run --example <name> or cargo test --example <name>.

These programs have access to only the public API of your crate and are intended to illustrate the use of your API as a whole. Examples are not specifically designated as test code (no #[test], no #[cfg(test)]), and they're a poor place to put code that exercises obscure nooks and crannies of your crate—particularly as examples are not run by cargo test by default.

Nevertheless, it's a good idea to ensure that your CI system (Item 32) builds and runs all the associated examples for a crate (with cargo test --examples), because it can act as a good early warning system for regressions that are likely to affect lots of users. As noted, if your examples demonstrate mainline use of your API, then a failure in the examples implies that something significant is wrong:

  • If it's a genuine bug, then it's likely to affect lots of users—the very nature of example code means that users are likely to have copied, pasted, and adapted the example.
  • If it's an intended change to the API, then the examples need to be updated to match. A change to the API also implies a backward incompatibility, so if the crate is published, then the semantic version number needs a corresponding update to indicate this (Item 21).

The likelihood of users copying and pasting example code means that it should have a different style than test code. In line with Item 18, you should set a good example for your users by avoiding unwrap() calls for Results. Instead, make each example's main() function return something like Result<(), Box<dyn Error>>, and then use the question mark operator throughout (Item 3).

Benchmarks

Item 20 attempts to persuade you that fully optimizing the performance of your code isn't always necessary. Nevertheless, there are definitely times when performance is critical, and if that's the case, then it's a good idea to measure and track that performance. Having benchmarks that are run regularly (e.g., as part of CI; Item 32) allows you to detect when changes to the code or the toolchains adversely affect that performance.

The cargo bench command runs special test cases that repeatedly perform an operation, and emits average timing information for the operation. At the time of writing, support for benchmarks is not stable, so the precise command may need to be cargo +nightly bench. (Rust's unstable features, including the test feature used here, are described in the Rust Unstable Book.)

However, there's a danger that compiler optimizations may give misleading results, particularly if you restrict the operation that's being performed to a small subset of the real code. Consider a simple arithmetic function:

#![allow(unused)]
fn main() {
pub fn factorial(n: u128) -> u128 {
    match n {
        0 => 1,
        n => n * factorial(n - 1),
    }
}
}

A naive benchmark for this code:

#![allow(unused)]
#![feature(test)]
fn main() {
extern crate test;

#[bench]
fn bench_factorial(b: &mut test::Bencher) {
    b.iter(|| {
        let result = factorial(15);
        assert_eq!(result, 1_307_674_368_000);
    });
}
}

gives incredibly positive results:

test bench_factorial             ... bench:           0 ns/iter (+/- 0)

With fixed inputs and a small amount of code under test, the compiler is able to optimize away the iteration and directly emit the result, leading to an unrealistically optimistic result.

The std::hint::black_box function can help with this; it's an identity function whose implementation the compiler is "encouraged, but not required" (their italics) to pessimize.

Moving the benchmark code to use this hint:

#[bench]
fn bench_factorial(b: &mut test::Bencher) {
    b.iter(|| {
        let result = factorial(std::hint::black_box(15));
        assert_eq!(result, 1_307_674_368_000);
    });
}

gives more realistic results:

test blackboxed::bench_factorial ... bench:          16 ns/iter (+/- 3)

The Godbolt compiler explorer can also help by showing the actual machine code emitted by the compiler, which may make it obvious when the compiler has performed optimizations that would be unrealistic for code running a real scenario.

Finally, if you are including benchmarks for your Rust code, the criterion crate may provide an alternative to the standard test::bench::Bencher functionality that is more convenient (it runs with stable Rust) and more fully featured (it has support for statistics and graphs).

Fuzz Testing

Fuzz testing is the process of exposing code to randomized inputs in the hope of finding bugs, particularly crashes that result from those inputs. Although this can be a useful technique in general, it becomes much more important when your code is exposed to inputs that may be controlled by someone who is deliberately trying to attack the code—so you should run fuzz tests if your code is exposed to potential attackers.

Historically, the majority of defects in C/C++ code that have been exposed by fuzzers have been memory safety problems, typically found by combining fuzz testing with runtime instrumentation (e.g., AddressSanitizer or ThreadSanitizer) of memory access patterns.

Rust is immune to some (but not all) of these memory safety problems, particularly when there is no unsafe code involved (Item 16). However, Rust does not prevent bugs in general, and a code path that triggers a panic! (see Item 18) can still result in a denial-of-service (DoS) attack on the codebase as a whole.

The most effective forms of fuzz testing are coverage-guided: the test infrastructure monitors which parts of the code are executed and favors random mutations of the inputs that explore new code paths. "American fuzzy lop" (AFL) was the original heavyweight champion of this technique, but in more recent years equivalent functionality has been included in the LLVM toolchain as libFuzzer.

The Rust compiler is built on LLVM, and so the cargo-fuzz subcommand exposes libFuzzer functionality for Rust (albeit for only a limited number of platforms).

The primary requirement for a fuzz test is to identify an entrypoint of your code that takes (or can be adapted to take) arbitrary bytes of data as input:

#![allow(unused)]
fn main() {
/// Determine if the input starts with "FUZZ".
pub fn is_fuzz(data: &[u8]) -> bool {
    if data.len() >= 3 /* oops */
    && data[0] == b'F'
    && data[1] == b'U'
    && data[2] == b'Z'
    && data[3] == b'Z'
    {
        true
    } else {
        false
    }
}
}

With a target entrypoint identified, the Rust Fuzz Book gives instructions on how to arrange the fuzzing subproject. At its core is a small driver that connects the target entrypoint to the fuzzing infrastructure:

// fuzz/fuzz_targets/target1.rs file
#![no_main]
use libfuzzer_sys::fuzz_target;

fuzz_target!(|data: &[u8]| {
    let _ = somecrate::is_fuzz(data);
});

Running cargo +nightly fuzz run target1 continuously executes the fuzz target with random data, stopping only if a crash is found. In this case, a failure is found almost immediately:

INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 1607525774
INFO: Loaded 1 modules: 1624 [0x108219fa0, 0x10821a5f8),
INFO: Loaded 1 PC tables (1624 PCs): 1624 [0x10821a5f8,0x108220b78),
INFO:        9 files found in fuzz/corpus/target1
INFO: seed corpus: files: 9 min: 1b max: 8b total: 46b rss: 38Mb
#10	INITED cov: 26 ft: 26 corp: 6/22b exec/s: 0 rss: 39Mb
thread panicked at 'index out of bounds: the len is 3 but the index is 3',
     testing/src/lib.rs:77:12
stack backtrace:
   0: rust_begin_unwind
             at /rustc/f77bfb7336f2/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/f77bfb7336f2/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_bounds_check
             at /rustc/f77bfb7336f2/library/core/src/panicking.rs:159:5
   3: somecrate::is_fuzz
   4: _rust_fuzzer_test_input
   5: ___rust_try
   6: _LLVMFuzzerTestOneInput
   7: __ZN6fuzzer6Fuzzer15ExecuteCallbackEPKhm
   8: __ZN6fuzzer6Fuzzer6RunOneEPKhmbPNS_9InputInfoEbPb
   9: __ZN6fuzzer6Fuzzer16MutateAndTestOneEv
  10: __ZN6fuzzer6Fuzzer4LoopERNSt3__16vectorINS_9SizedFileENS_
      16fuzzer_allocatorIS3_EEEE
  11: __ZN6fuzzer12FuzzerDriverEPiPPPcPFiPKhmE
  12: _main

and the input that triggered the failure is emitted.

Normally, fuzz testing does not find failures so quickly, and so it does not make sense to run fuzz tests as part of your CI. The open-ended nature of the testing, and the consequent compute costs, mean that you need to consider how and when to run fuzz tests—perhaps only for new releases or major changes, or perhaps for a limited period of time.1

You can also make subsequent runs of the fuzzing infrastructure more efficient, by storing and reusing a corpus of previous inputs that the fuzzer found to explore new code paths; this helps subsequent runs of the fuzzer explore new ground, rather than retesting code paths previously visited.

Testing Advice

An Item about testing wouldn't be complete without repeating some common advice (which is mostly not Rust-specific):

  • As this Item has endlessly repeated, run all your tests in CI on every change (with the exception of fuzz tests).
  • When you're fixing a bug, write a test that exhibits the bug before fixing the bug. That way you can be sure that the bug is fixed and that it won't be accidentally reintroduced in the future.
  • If your crate has features (Item 26), run tests over every possible combination of available features.
  • More generally, if your crate includes any config-specific code (e.g., #[cfg(target_os = "windows")]), run tests for every platform that has distinct code.

This Item has covered a lot of different types of tests, so it's up to you to decide how much each of them is relevant and worthwhile for your project.

If you have a lot of test code and you are publishing your crate to crates.io, then you might need to consider which of the tests make sense to include in the published crate. By default, cargo will include unit tests, integration tests, benchmarks, and examples (but not fuzz tests, because the cargo-fuzz tools store these as a separate crate in a subdirectory), which may be more than end users need. If that's the case, you can either exclude some of the files or (for behavior tests) move the tests out of the crate and into a separate test crate.

Things to Remember

  • Write unit tests for comprehensive testing that includes testing of internal-only code. Run them with cargo test.
  • Write integration tests to exercise your public API. Run them with cargo test.
  • Write doc tests that exemplify how to use individual items in your public API. Run them with cargo test.
  • Write example programs that show how to use your public API as a whole. Run them with cargo test --examples or cargo run --example <name>.
  • Write benchmarks if your code has significant performance requirements. Run them with cargo bench.
  • Write fuzz tests if your code is exposed to untrusted inputs. Run them (continuously) with cargo fuzz.

1

If your code is a widely used open source crate, the Google OSS-Fuzz program may be willing to run fuzzing on your behalf.