Rust’s rustc compiler has a surprising number of knobs to twist for release builds, and they can shave off significant time from your build process and even improve runtime performance if you know where to look.
Let’s peek under the hood of a typical release build using cargo build --release. Cargo, by default, uses a specific set of optimizations (opt-level = 3, lto = "fat", codegen-units = 1) and a debug profile that’s stripped down. But there’s more to explore.
Consider a simple Rust project. Normally, you’d just run cargo build --release. This command triggers rustc with a default set of optimizations.
// src/main.rs
fn main() {
let mut sum = 0;
for i in 0..1000000 {
sum += i;
}
println!("Sum: {}", sum);
}
If you run cargo build --release on this, Cargo consults Cargo.toml. The default release profile looks something like this:
# Cargo.toml
[profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1
panic = "unwind"
strip = true
This profile tells rustc to go all-in on optimizations. opt-level = 3 enables the most aggressive optimizations the LLVM backend supports. lto = "fat" (Link Time Optimization) allows LLVM to optimize across crate boundaries, seeing the whole program as one unit. codegen-units = 1 instructs LLVM to perform optimizations on the entire compilation unit at once, rather than breaking it into smaller chunks, which can be slower to compile but often results in faster code. strip = true removes debugging symbols to reduce binary size.
The mental model for release builds is simple: trade compile time for runtime performance and binary size. You want the compiler to do as much heavy lifting as possible before the code runs. This means enabling aggressive optimizations, allowing LLVM to see as much code as possible at once (LTO), and potentially removing any metadata that isn’t strictly necessary for execution.
But what if your build times are still too long, or you suspect the binary could be smaller? You can tune these flags further. For instance, opt-level can be set to s or z to prioritize code size over speed, which is crucial for embedded systems or distribution.
# Cargo.toml for size optimization
[profile.release]
opt-level = "s" # Optimize for size
lto = true # Link Time Optimization
codegen-units = 1
panic = "abort" # Abort on panic, smaller code
strip = true
Here, opt-level = "s" tells LLVM to optimize for code size, potentially sacrificing a bit of runtime speed. panic = "abort" replaces the default unwind behavior with a simple program exit, which can result in a smaller binary by removing the unwinding code. lto = true is a shorthand for lto = "fat" when using opt-level = 3 or higher.
A lesser-known but powerful aspect is how lto interacts with codegen-units. While codegen-units = 1 with lto = "fat" is often the default for maximum optimization, you can sometimes achieve a good balance by increasing codegen-units slightly (e.g., to 2 or 4) while keeping lto = "fat". This can speed up compilation by allowing LLVM to work on more independent units, and LTO can still stitch them together for cross-module optimization. However, for the absolute best runtime performance, codegen-units = 1 is usually king.
When you’re dealing with very large codebases or specific performance bottlenecks, you might experiment with LLVM’s specific flags. These are passed via RUSTFLAGS environment variable. For example, to enable a specific LLVM optimization pass that might not be on by default for certain targets:
RUSTFLAGS="-C target-cpu=native -C llvm-args=-enable-load-store-fusion=true" cargo build --release
This example uses -C target-cpu=native to tune the code for the specific CPU architecture of the build machine, potentially unlocking instruction sets or optimizations specific to that CPU. The llvm-args part passes an LLVM-specific flag directly to the LLVM backend, in this case, enabling a load-store fusion optimization that might improve instruction-level parallelism.
The next frontier after optimizing your release builds is understanding how to profile the resulting binary effectively to identify actual performance bottlenecks, rather than just blindly applying optimizations.