cargo-machete, find unused dependencies quickly

2022-04-27

cargo-machete is a new Cargo tool that detects unused dependencies in Rust projects, in a fast (yet imprecise) way. As of today you can install it with cargo install cargo-machete and then run it with cargo machete from any folder that contains a workspace or crate, to find if you have potentially unused dependencies. Beware, it can report a few false positives!

Problem statement

When developers hack on code, it’s a pretty common to reuse software that already exists and has been written, optimized, and battle-tested by many others. In fact, that’s a core idea of the open-source movement, and one historical reason for its existence.

When zooming in into the Rust programming language case, my opinion is that it is also a key reason why Rust has been so successful: having plenty of crates doing everything you might need, already implemented for you and at hand’s reach on crates.io. Plus, having the wonder of a one-does-it-all Cargo tool that makes it very easy to use those crates as dependencies in your project. ^[1]

However, this comes with a price: sometimes you add a dependency because it’s useful at a particular point in time. Much later, it’s not useful anymore, but you may have forgotten about it. And then, the dependency remains as a zombie in your Cargo.toml file. Cargo will include it in the compilation graph, despite the compilation artifacts not being used at all. The unused dependency will just stay there, silently weep, waiting for you to recall it exists.

Of course, the problem can even become worse: maybe you maintain several crates that have unused dependencies. Or maybe you work with many crates as part of a workspace, and each may have unused dependencies. Or simply you use many dependencies yourself, and some may include unused dependencies. If you’ve published your crates and others use those, then everyone could also compile unused dependencies. At the scale of the entire Rust crates ecosystem, it can have a huge impact on the compile times, produced heat and wasted energy.

Have you heard about our lord and savior, `cargo-udeps`

There’s already a nice tool for this in the ecosystem: cargo-udeps. It will compile your crate (or workspace) and then infer from the compiled artifacts what dependencies are used by your project, and thus show you which dependencies are unused.

That’s great, but the way it works forces a few tradeoffs:

it requires to compile the whole crate with the Rustc nightly compiler. For me that means recompiling the whole project from scratch, most of the time, since I’m mostly using stable rustc as my daily driver.
if you compile for multiple targets (i.e. different combinations of CPU flavor, OS, environment, etc.), you’d need to run cargo-udeps on each of those to find per-target unused dependencies. For instance, if a dependency is only configured when compiling for x86_64 machines, then it may be flagged as unused on every other configuration.
most of all, since it look at compilation artifacts, it cannot know if a specific dependency is directly used by your crate, or indirectly, leading to somehwat mystifying results in the case of workspaces.

Let’s dive a bit deeper into the last item, which I’ll refer to as the transitively-used dependencies problem. Say you have your project AAA that contains a dependency to serde in its Cargo.toml file, while it’s not directly used by your code. In fact, if you did a text-search of serde in AAA’s code with grep, you wouldn’t find a single match^[2]. But now AAA is using another crate, AB, that itself depends on serde. cargo-udeps will see that serde is used overall, so it cannot let you know that AAA’s Cargo.toml file references an unused dependency to serde.

Graph of crates containing one unused crate

How is this a problem? After all, if the workspace uses serde even indirectly, then we will have to compile it at some point, so it’s not like it’s really unused.

First of all, the AAA crate might be using a different version of serde than the AB crate, and this could result in different copies of the same crate in your workspace. Note there are other nice tools that automatically detect this kind of situation (hi there cargo-deny).

Second, the order in which crates are compiled has an impact on compilation parallelism, and having unused dependencies may add spurious synchronization points in the compilation graph. When a Rust crate gets compiled by Cargo, Cargo proceeds in two phases:

first, it collects information so as to unlock the compilation of other crates further down the road that may depend on this particular one. I don’t know precisely what it entails, but one can make educated guesses: parse the code, analyze which items are public, compute memory layouts for public types, collect type information and so on and so forth.
then, it does the actual compilation: optimize and generate the actual machine code for that particular crate, that will be later linked with other artifacts to form the final executable program.

The advantage of this two-phases scheme is that once Cargo is done with phase 1 for a particular crate, it can kick off the same process for other crates up the dependency tree, while it runs phase 2 concurrently. With a multi-core machine as is the norm on desktop computers, it’s almost certain that this will bring speedups!

For instance, consider the following Cargo.toml file from our previous example project:

[dependencies]
serde = "1.0"

Then a possible compilation graph could look like that:

Compilation graph showing phases

In this case, ab phase 1 can start as soon as serde phase 1 has finished, while serde’s compilation phase 2 happens in the background.

If you’re interested in reducing the overall compile times of your Rust project, I would strongly suggest to go read Rust’s documentation around timings visualization. Crates which spend lots of time in the first phase (or more generally, in both phases) are basically pipelining bottlenecks, so identifying/removing/working around them overall speeds up compile times.

Back to our small unused dependency problem: an unused dependency in your Cargo.toml may block the compilation of other crates up the dependency tree, and thus may slow down the whole compilation process by creating useless check points.

Consider a crate C that depends on crates A and B, with B actually unused:

Pipeline stall

Here, the compilation of the crate C could start way earlier, but it’s blocking waiting for the compilation of B to finish first, while it’s not even used!

Solving this, the naive way

So when I was trying to confirm whether crates found by cargo-udeps were actually used or not in my Rust projects, the thing I’d do would be to grep (or better, use the blazingly fast Rust replacement ripgrep) the crate’s name in the project. After all, the crate’s name is in the source directory, if and only if the crate is used, right?

The answer is… mostly, yes. If we exclude dynamic code loading via mechanisms like dlopen or WebAssembly, then there aren’t so many ways to use other crates directly, in Rust code. In fact, we can exhaustively enumerate all the syntax items to use other dependencies in Rust:

use my_crate;
use your_crate as my_crate;
use { your_crate as my_crate };
extern crate my_crate;

fn main() {
    my_crate::something();
}

I’ve looked at a bit of Rust code now, and I haven’t seen other direct forms; if I am missing any, please let me know! Now, these are the most frequent ways to use a dependency, but there are in fact other ways:

build.rs scripts can generate code that could use other crates, and that would not be visible through a text search in the src/ directory, as the generated code is somewhere inside the target/build/ directory.
macros (procedural or not) can expand to code that’s using other crates, while the source code doesn’t explicitly mention them. For instance, the log_once crate uses the log macros in its own macros, but log_once doesn’t depend on log explicitly. It’s a bold and smart move: it breaks the coupling with the specific version of log , and as long as the high-level API of log is stable (which is the case), then log_once works with any version of log.

And then, there’s still a bit of room for some false positives:

raw text submatches: e.g. if a crate is named bar, then foobar:: would be a match if we’re doing a raw grep search
text search isn’t syntaxic analysis, and we wouldn’t know if a match is in a comment (// use foo;), or a string (String::from("use foo;")).

But that would do most of the job, wouldn’t it? In particular, compared to cargo-udeps, this approach doesn’t suffer from the transitively-used dependencies problem. If you look for a crate’s name in the src/ directory and it’s not there, it’s likely not used by your crate. The End.

A tedious process calls for automation, so I made a tool

And I’ve called it cargo-machete. Like a machete, it is very useful for quickly weeding out things, but it is very imprecise and you wouldn’t trust it at 100%.

The gist of it is:

find directories that might contain Rust projects, as indicated by the presence of a Cargo.toml file
for each dependency, create an absolutely ugly regular expression that matches any of the syntaxic forms presented above. The regular expression does better than just raw text search, in particular it doesn’t run into the text submatch issue.
- then for each file in the project, try to match the regular expression against each line of any source file, and stop at the first successful match (which means the dependency is used)

This tool is fast, because it combines the core library behind ripgrep for matching regular expressions, with rayon for running it in parallel across all the dependencies of a project. On my machine, the problem is CPU-bound, because of the execution of the regular expression (and maybe thanks to my NVME storage too). That’s only one data point, but on this particular beefy desktop I use, it scans the entirety of the rust-lang/rust repository in 1.08 seconds, or all of BytecodeAlliance/wasmtime in 0.58 seconds.

The tool is open source, of course.

As is the tradition for Cargo tools, it can be installed with:

cargo install cargo-machete

and then can be used, from any directory that contains Rust code (be it a workspace, a single project, or a directory on top of many Rust projects), with the following line:

cargo machete

Here’s an output example:

> cargo machete
Looking for crates in this directory and analyzing their dependencies...
/home/ben/code/cargo-machete/integration-tests/with-bench/Cargo.toml -- no package, must be a workspace
just-unused -- /home/ben/code/cargo-machete/integration-tests/just-unused/Cargo.toml:
	log
unused-transitive -- /home/ben/code/cargo-machete/integration-tests/unused-transitive/Cargo.toml:
	lib1
Done!

There are false positives: code generated via macros or build scripts aren’t inspected as they’re not in the src/ directory and cargo-machete doesn’t run any compile step. For instance, if a project depends on log , but uses it only through log_once, then cargo-machete will incorrectly flag log as an unused dependency.

The good news is that, thanks to a contribution from @daniel5151, you can specify known false positives in the Cargo.toml file of your crate, allowing use of cargo-machete in CI setups:

[package.metadata.cargo-machete]
ignored = ["log"] # false positive, used by log_once! macro

As far as I know, the risk for false negatives (i.e. crates that are unused, but the tool thinks they’re used) is pretty low. One such instance would be a multi-line string containing one of the use forms, but that seems rather unlikely to be present in most Rust projects.

The tool is still a bit rough, but it’s been already quite useful for some projects I’ve been working on! In a particular work project, most unused dependencies were transitively used and compiled, but the rejiggering of the compilation graph lead to a 5% compile time speedup overall. Good impact over effort ratio.

What about other languages?

What makes this possible in Rust, and could it be extended to other languages?

Dynamic languages by nature dynamically load code, but there are still ways to try to automate detecting unused dependencies same as cargo-machete does. Consider JavaScript and its require function, that can dynamically evaluate a string that’s a path to a file with code we want to import. Since there’s an infinity of ways to create a string, we can’t just perfectly rely on finding require("abc") and assume that if not present, then abc isn’t used. Ditto with import statements, which can evaluate dynamic sources. That being said, if JS code is restricted to use require statements with only static strings or static import statements, then this may work too! Although when restricting to static requires, even just loading the code in NodeJS would be sufficient to find unused dependencies with perfect accuraccy.

Back to static languages, where I constrain the problem to non-dynamic dependencies (loaded via dlopen etc.). In a language like C or C++, there are no unified module systems or package description (yet! although cmake might be a de-facto standard). We can still apply this to header files, and look for their inclusion via #include statements. Macros and preprocessed code would also throw a wrench in the process. Then some human intervention would still be required to eliminate the .c files, but I haven’t thought about it too much.

Static analysis of compiled binaries might be simpler, for that matter. If we consider the problem for WebAssembly, we can frame it as “which imported functions are not used in the module”, potentially eliminating an entire range of host functions. In the simplest case, we could just look at the code section, through the function bodies, and see if there’s any reference to indices of every single imported function in call opcodes. Then, there can be function Tables referencing those, so we have to make sure no table elements reference the function. And if any table is mutable and publicly exposed via an export, then a user of the wasm module may reference any function declared in the wasm module, including imported functions, so all bets are off. Note dead-code elimination in wasm would be pretty similar and suffer from the same limitations: after all, a function dependency is just another kind of function, in wasm! Each format may have such idiosyncrasies like that. Static analysis of final binaries (as opposed to libraries) might be possible and reliable, though.

Closing thoughts

For the sake of completeness, I should mention the existence of a rustc crate-wide lint for this, since May 2020 or so: #![warn(unused_crate_dependencies)]. This tells about unused crate dependencies directly as a Rust warning, which in my opinion would be the ideal end goal! Unfortunately, some Github comments suggest it suffers from having too many false positives, and still it requires compiling the code.

@est31, of cargo-udeps’s fame, has been working on a better solution. It seems to not be so far from completion, so between this and the Rust lint, I’m hopeful that there could be a time where we have a solution that is perfectly precise, with neither false positives nor false negatives.

In the meanwhile, I hope that cargo-machete can be useful to some of you, or that it inspires others to make similar quick-and-dirty tools, in Rust or in other languages. Thanks for reading this far, and please get in touch if you have any thoughts about this!

If you don’t know about the cargo-edit tool that allows you to add a dependency in one line with cargo add serde to your project: now you do. ↩
Wait for it… ↩