I Was Wrong About Rust Build Times

RSS podcast badge Spotify podcast badge Apple podcast badge

An update to previous research about speeding build times, informed by unexpected increased cost of maintenance

Audio

M4A

Download as M4A

Show Notes

Episode Sponsor: Tweede golf

Transcript

Amos: All right. So my topic today is: I was wrong about Rust build times. So a while ago, let's check it out. I have the article open here. I, In December of 2021, I published a 30 minute read- you know that's a lie. My reading time estimator is lying. But I have unit tests, with, with some of my articles, checking that the reading time is what it should be. So I can't change it now, because then I would have to update the test.

Anyway, it's about why is my Rust build so slow, and there's a bunch of headers here: what is cargo even doing? How much time are we spending on these steps? This is talking about cargo build --timings, uh, which I have here as this still relevant in my slides because I have slides today.

James: I read this blog post literally last week or two weeks ago when I was troubleshooting why certain builds were slow and the answer was just a lot of generated serde code. So all that stuff I was talking about last time of measuring that I read this article of going through and making sure that I had all of my like line items that I wanted to check when I was digging into it.

Amos: How did you end up finding out that it was serde?

James: The way that I ended up figuring it out was I looked at the volume of generated code. So the step that I ended up doing was running cargo expand, which gives you the output of everything after expansion and counting the line counts before and after, but also then running that expanded code through a tool called Form, basically it takes a single file and by module splits it up into multiple files, which means if you expand, which gives you one output and then run it through Form, you end up almost with the same folder structure, which means that when you then run it through like Tokei or another tool, you can do the before and after and be like, "Ah, this was 2,000 lines before expansion, and 54,000 lines after expansion."

So I don't have like a smoking bullet that it was serde, but serde and one other proc macro were generating like literally 7 or 8 times the size of an already 50,000 line Rust crate.

Amos: Right, right. So the figures you just gave like 2,000 to 54,000. Those are not exact, but it was in that ballpark, right?

James: For some of them, yeah. If they were just data structures in there, then it would blow up like that. If there was a lot of code, which wouldn't expand as hard, but there was a couple files we had that were just data structures, that became very large generated code.

Amos: Okay. So what the heck should we do about proc macros and can we get introspection going on again in the Rust project are two topics that I think will come up a bunch if we keep doing these.

James: Yeaahhh.

Amos: But okay, I'm going to, I'm going to go back on track on the article. I'm happy that it helped you. It did help a lot of other people as well, I know. Cause I got a, got a lot of good feedback. I'm generally happy about this article. It talks about picking a better linker. Since what has happened is that at Worldwide Developer Conference, I think that's what their acronym stands for: WWDC, the Apple Developers Conference thingy in 2022. Uh, they made their linker twice as fast, which makes it less useful to switch to a different linker. I don't have it here, but the author of Mold tried different schemes. Mold was a third party alternative linker that was faster than all the other linkers, written in C++ I believe. And the author tried a bunch of schemes to monetize it, and eventually kind of gave up, I believe, and just said, "Okay, I guess I'm not gonna," cause I think he quit his job. I should, you know, I should source this better.

James: He was the developer that actually developed lld for LLVM, and then went on to build mold, and was selling it as sold, but what came around was essentially: Apple figured out what they were doing with sold, rewrote ld64 for their own thing and then went, "Well, that's kind of, you just ate all of my competitive gain. I'm not going to maintain a product anymore because there's no reason for me to recommend it for anyone to buy it, now that ld64 is just fast in the same way."

Amos: Yes. Which is interesting because we know Apple does that with apps, right?

James: It's called getting Sherlocked.

Amos: Like the app of the year, one year is going to be the next Apple app next year. So they do that, they let developers make things in their walled garden and then they make their own version and then you're irrelevant. But for me, I don't, I can't think of another high profile instance of this happening in open source. I guess...

James: Well, they do tiling windowing now. So I guess, that, that chewed a little bit, but yeah.

Amos: I'm always surprised, like ever since I went back to macOS full time. I used to have like a Windows machine, Windows 11 with Linux VMs. And now I just have Macs all around the house. It's just, there's no problem that several thousand dollars cannot fix. I'm always surprised to see the thriving ecosystem, like little Mac apps that just add a little thing that's missing in the ecosystem. And I'm always happy to throw them like between 10 and 30 bucks. Cause I'm like. It's a lifetime license. It's something I need. I should make an article about all these.

So Apple killed mold, and now the built-in thing is kind of okay. So RIP, but also good for everyone else. I don't know.

In terms of linker as well, that slide went missing for some reason, but recently rustc nightly started shipping their own... not their version of, just like... they started shipping lld and they started using it by default on Linux x86-64 Bit. I don't know how much is in common between like- Apple's linker is as lld's linker, no?

James: I don't think it is but I'm, I'm guessing here. It's not the same, but it might be a fork. I don't know.

Amos: Well, someone should check it out. If only there were two dudes talking and.. (laughs) I'll check it out before next week. So what else in the article? The article talks about incremental builds. I get sad every time I look at this because I, I've talked to some people who were involved with rustc around those features and my understanding is that there were plans to do incremental building, but better, or for more things like... it's partially done and it works well enough locally? But definitely for CI, for example, it's not even an option because it's just so hard to know what to commit to cache.

So if you're thinking of a traditional GitHub actions, CI cache workflow, you restore the cache at the beginning of the build, then you save the cache at the end of the build, and the first problem you end up having is that: suddenly the cache is 55 gigabytes and that's too many gigabytes. And so you don't know what to prune, what to keep. You don't know if you can use the main branch cache for the PRs, if you should bring back the PR cache to the… there’s security concerns, what if someone poisoned the cache, there's all sorts of things.

And that's kind of, still doable with the non-incremental builds in Rust. But with incremental builds... I don't really think anyone's trying to do it. I think a lot of tools, like if you build Rust with Nix or if you build Rust with like Bazel or Buck or I don't know, I still need to try those and do a comparison. It's just… incremental building is, nice locally for your machine.

Actually there is one person or one company doing incremental builds with caches in CI, but what they did is just essentially have persistent workers I'm assuming they’re still cloud-based somewhere, but they have a bunch of workers and they just like clone them and they just bring the build to them instead of like restoring the cache somewhere else.

James: So like old-school Jenkins, not GitHub Actions where everything was a persistent machine and there's no isolation or VMing.

Amos: I'm assuming they're careful about it. I'm assuming it's still “cloudy” and like they can scale it up. Cause it's easy to like clone a VM and I do Copy-on-Write. I don't know. There's some magic happening. Probably. It's the Earthly people.

I will I will add the links because I asked ChatGPT what article I was thinking of and it finally- it found it. When you let it do web searches it can find things. So it gave me all articles and I was like: no, no, not this one. Not sccache none of that and it found it was the Earthly folks who did it for I forget who. But we will have the link.

There's a lot of things about like rustc self-profiling that is still accurate as far as I know, some crates still hit pathological cases in rustc, that's still a thing, you can get some nice gains if you know where to look and there's cargo-llvm-lines that people are using with great success. But the one thing this article says that is actually, I think, in retrospect, a very bad idea is splitting it into more crates. And I will explain both the original reasoning, why I think it's bad and what I'm doing instead.

James: I was gonna say, because this was my only suggestion to my client when they said, "How do we make this go faster?" And I go, "Well, we probably shouldn't have one crate that's 50,000 lines before expansion and like 200 or 300,000 lines after expansion," because you lose parallelism at that, point.

But I'm interested to hear why you're turned around because before I was real ready to suggest what you suggest in this blog post. Cause it made sense to me.

Amos: So I ended up having too many crates is one of the things. And it ended up- it's not directly related to compile times, but like development in general. Having to maintain those 14 different crates, even though you have like a cargo upgrade by cargo-edit, it's just a lot of, even if you inherit dependency versions from the workspace, even if you inherit the crate version for our workspace, for all things you can do. I don't know, I had to think about a lot of things, uh, all the time. It was, It was a barrier to just developing the thing. So that's one thing.

I think where my calculation went wrong — and I want to actually go and back it with hard data, which I don't have today, but I may have in the future for an actual write up — is that: sure, the crates can compile in parallel, but define “compile”, right? They can be parsed in parallel. That was already a thing. I don't know if cargo/rustc actually does it, but it could, right? I'm sure there's opportunities for parsing things in parallel. I haven't checked, but-

James: There's the parallel front ends feature. And then I can't remember all the stages that are real or nightly only today.

Amos: Yeah, if you run cargo build --timings, you can see your different crates actually building in parallel. But it ends up doing more work because since you're now using things across crates, you have to declare a bunch of things pub.

So whereas previously you could do like intra-crate, so inside of the same crate, it could do inlining, it could do a lot of things because it has all the code. It doesn't even need to like export those symbols. It's, they're not in the object files and they're in the archive files. The linker doesn't have to know about them. rustc can just do its little, its little cooking on the side and then generate only what LLVM needs to know about. Now there's a bunch of things that are pub and so it's all exported, it's all in file and the linker now has to worry about all of that. Which would be fine because as we just said, linkers got faster.

But the other thing is now we have link-time optimization (LTO), which I really want to measure because I feel like it's taking so long, even if you have like the fastest linker on earth that does all the optimizations that mold did, and still does, you still have this thing where like compilers, well linkers will call a plugin to figure out what optimizations can they do across different compilation units and that takes a lot of time.

And if you're splitting all your crates in different crates, it has to do this every time because you have one binary that depends on all your workspace and every time you change any of these, it has to link again. If it was like one crate and you have incremental caching, then it can cache actually some of that. But I don't think the result of LTO is cached anywhere.

James: Well I think ThinLTO does. So I think this is the difference between ThinLTO and FatLTO. So FatLTO, my working model for it is: it takes all the compilation artifacts, vomits them as if they were one giant object file, and then runs optimization on them again without having this separate. Whereas ThinLTO keeps the LLVM, like, bitcode for each of the stages and can't do quite as powerful optimizations, but it doesn't have to, like you said, re-vomit everything into one pile so that it can optimize it again. It can do this all by parts, but it does lose some optimization opportunities there.

Amos: Sure. But again, I wish I had numbers to back it up, but in my experience, it still spends time there. It is yeah, for sure. FatLTO- which is non ThinLTO, whatever, regular LTO, classic LTO- definitely takes more time than ThinLTO. Also I think the default for dev builds is something called thin local LTO or something? Like crate local... yeah, that's, this is, this is like another level.

James: The LTO flag in cargo.toml is one of the worst ones because it has a value- like you can put a string or a boolean there and if you put “false” it does ThinLTO and if you do “true” it does full LTO...

Amos: So false does a thin-local LTO, the thing I was just talking about. “true” or “fat” is the same thing.

James: It trips people up all the time and it's one of those I always have to go and look at it because I'm like — It makes sense historically why it's like that, but it makes me so upset because we don't usually deal with truthiness in Rust, like. And this is one of the few times we have to and it confuses everyone when I see it.

Amos: This is, worse than YAML, but because yeah, it's false and off are different values here. At least in YAML it's consistent. Yeah.

James: And also Norway.

Amos: I think in retrospect it was bad advice. It didn't end up helping much for me. I think the numbers are like wishy-washy. I didn't, I wasn't very scientific about it.

James: The other thing you haven't mentioned is the Orphan Rule. This is the other one that I see people complain about a lot when they have broken things up into many crates. Is, like you said, the other maintenance things of having to span multiple crates. The Orphan Rule is a big one where you want to be able to impl your trait for a lot of things and that becomes way harder when you leave, when you go from modules to crates. So, do you have something that you would recommend today or is this where the research this is like proposal for more research of I need to figure out which-

Amos: So, okay, so the slides end here, but my— my thoughts don't.

James: Okay.

Amos: It's not just, "I was wrong and I think it was a bad idea." I do have a plan that I'm currently in my proprietary code base. I don't know how much proprietary code I have, but actually it's pretty easy to find out now because I just brought everything together. That sounds wrong. I don't have million eight hundred thousand lines of code. Where does that come from?

I have about 18,000 lines of Rust because my website does a lot of things. It has this whole asset pipeline. It encodes images to like AVIF and WebP and whatnot. And I have a whole video pipeline as well, because I have a clone of the YouTube player on my website for patrons, so you don't have to sit through ads. I have the Markdown pipeline. I have KaTeX for equations. I have a lot of dependencies. I think it's about a thousand crates and it used to be separated into a bunch of different code bases. And one of the code bases, the blog, used to be separated a bunch of crates because of that article.

So what I've spent the last couple days doing is actually put everything back into one single crate. And then the rest of the plan is to isolate things by how long it takes to build and make those into dynamic libraries, which in most other languages- well, no, in C or C++ would be okay, I guess, because that's what people have been doing forever. And it's, it's kind of how Linux distributions work. You just install dynamic libraries somewhere and then everything like shares the same version of things.

But in Rust specifically, it's complicated because we don't have a stable ABI for a good reason, but still. The compiler kind of doesn't want you to do that. It's not just the switch from like, "Oh, this crate should be dynamic. Just do that." I think it's not that easy. Or maybe it is, but like, then you have to make sure that the compiler version used is the exact same. So you've run into these, this kind of issues.

James: You already have like extern rust unsafe stuff which you can do.

Amos: Well, actually tell me, cause I just, yeah, I just skipped over it. Like knowing it's a headache, but maybe, you know, the details of the headache. So please tell me.

James: I was gonna say, if you manually, "I solemnly swear that I'm only using one version of the compiler with it," then you can extern and like consume extern definitions and have them be linked in. Like I said, I do a lot of embedded, so we haven't done a lot of dynamic linking because most microcontrollers don't have dynamic linkers.

But, I know that's how the compiler generally works with things like compile derive macros and things like that. Those are essentially compiler plugins and we can do the same thing, but it, it ends up being very easy to footgun. But if you promise that you're always doing a clean build of the entire system so that you can do a deployment, then that's not necessarily very unreasonable.

It's just not very portable to give to someone else.

Amos: I was, I was fact checking you in real time. Here's the funny thing that happened. So I looked at extern Rust because I forgot that was a thing, but then I remembered as you brought it up. So I looked at the Rust reference, but then I looked at the proc macro bridge, because you explained that's how it works, just does extern Rust. And the first result is of course an article by me about proc macro support in rust-analyzer. I forgot. They do a weird, interesting thing which might be worth discussing another time.

If you can guarantee that you're using the same version of everything everywhere, then that's fine, but I don't want to have that constraint. So one crate that lets you have a stable ABI, no matter what the the rustc version is, is Stabby.

stabby-abi is the ABI crate itself. But stabby is the thing with like proc macros, and the README does a good job at explaining what it does, but yeah, essentially instead of making everything repr(C), which is what you would do if you wouldn't want a stable ABI most of the time, or if you have to interface with C and C++ you use this, and then you can have this sort of plugin system where you have some traits and they are implemented by shared objects that you can load at runtime.

And for me, for this code base specifically, this makes sense for me because I'm not actually going to load and unload them. Unloading them is the, the thing that you really should never do with Rust. I, I also have an article about loading and unloading things for like I forget what it's called... You're iterating on some things. You're just recompiling bits of it. They do that with Bevy for game development.

For this project, my blog, I just need to load all the plugins at runtime and I think I can get a lot of gains here because for example KaTeX pulls in an entire Javascript engine.

So if I only have to build that once, that's good. I don't know all of all of SQLite I can put SQLite into its own dynamic thing and it's fine. And so of course you lose out on like there's no LTO going on anymore because it's actually dynamic. It's too late to inline. There's no JITs there to save you. For like clear interfaces, I don't know, S3 storage pulls in a bunch of crates. It's all async. Async interfaces do work with Stabby. Just put them in a, in a shared object and see what happens.

So this is my new approach. I haven't finished it yet. I will keep you posted, but I just wanted to say that actually breaking things down into crates is not enough because you'll get caught at LTO time. It slows down development in general, and also I did it too much. I don't think it helped much because especially in a workspace, maybe that's the last thing I want to cover on this is that unless you use cargo-hakari, I don't know if you've used it before, which does the workspace hack.

James: Can you explain the workspace hack? Because I have a customer using it and I don't totally wrap my head around it because someone else at the customer set it up. But it broke some stuff and then the stuff got fixed. What is the workspace hack, and how does hakari help you with the workspace hack?

Amos: There's a chance the explanation I will give is completely wrong because I want to do it off the top of my head. I can tell you the problem it's supposed to solve. If in a workspace you have different members that depend on the same dependencies, say you have like three different crates in your workspace, they all depend on serde, but they have slightly different feature flags, then they don't actually share the artifacts, like, you're gonna have three different builds of serde for this workspace. And the workspace hack is just, okay, let's have a single crate that has all the dependencies that anything in your workspace depends on. And then-

James: Force unification, on all of it.

Amos: Yes, exactly. I have all the features you want enabled, like the union of all features that all the others are using. And then, there's a crate: cargo subcommand called cargo-hakari. I don't know how to pronounce it that sets it up for you and updates it for you. And it's really nice. And I've used it in the past, but yeah, if you don't do that, if you forget to do that, then you might think that, yeah, okay, I had this big dependency tree. I put it in a different crate and I don't touch it. So every time it's going to be cached. Actually, no, the rewards is not as good as I thought it was.

James: I'm interested, in this because I want to see someone do dynamic loading because I haven't seen many people mess with that, so I'm excited. Because I know when you write it up, it will be written up.

The other thing though is: I want to poke you to have a more realistic baseline. That if you are saying, "Hey, I'm not going to get LTO because I am isolating all these components," make that your baseline. Like, change your cargo profile so that you have LTO completely turned off, you have maximum code gen units, you have maybe even nightly like parallel front end and things like that of: okay, well if you really are saying that you're willing to give these things up because you're going to have dynamic plugins, what happens if you go all the way back? Because I've seen some A's and B's before we go, "Ah, this is the worst that this can get with full LTO and everything like that." And then you compare it with the leanest possible setup ever, where you've made all of these compromises. And I'm not saying you specifically, but this is that- I think we've talked about this before- the perils of benchmark of like, I've implemented 10 percent of the problem and "Hey, look, it's 10 times faster" kind of hint.

But I'd be interested to hear, especially if you have realistic perf numbers of how much slower is it without LTO? Or how much slower is it with dynamic linking if you have any sort of like perf numbers to play with because not just how fast it compiles... what are you losing for that trade off?

Amos: Yeah, for sure. Yeah, if I do a write up about this or a video, I haven't decided yet. It will be either of these because I think it's a very interesting topic. I will definitely, yeah, start from like, if you don't have any dynamic loading, you do like full static linking and like FatLTO, all the things, like you said, the codegen units to one, I think, is that it? Yeah, yeah.

James: Well one would make it more optimized but slower.

Amos: Yeah, yeah. So we go from like the most optimizations with the slowest build to like the other end of the spectrum without changing the linking style and then to dynamic linking. One thing I'm specifically excited about with dynamic linking is that, even if like, cause the C dependencies I've talked about get recompiled, like SQLite, like the JavaScript engine: those get rebuilt, I don't know, when I upgrade to a new major Rust version or like I cargo clean accidentally or something, but most of the time when you just run cargo check, what happens when you save the file in a text editor, they don't actually get rebuilt. But, cargo check is still slow. It's still freaking slow because there's like so many, so many crates in the dep tree.

One thing that didn't go all the way talking about incremental compilation is that: I really wish we just had a compiler server instead of a compiler CLI because you can design the cache file format best you can. It still has all this work to do. Like it wakes up. It's like, "Okay, do we have a Rust toolchain file around here? Okay. We found the right rustc. Let's boot it up, okay." And it has like, "Oh boy, we have, we have a lot of cache files on this, better read all those." And even if it's like, obviously it's going to be memory cache, the operating system, it's not actually going to read from disk. That's still a lot of work to then realize, "Oh, darn it. Everything's up to date. We don't have anything to do here." And it just shuts down. Sure, it's not Node.Js. It's not Ruby. It does that in like, a couple hundred milliseconds maybe, but it still work that a server could just say, "Yeah, I know it's up to date, nothing changed," instantly.

James: If you haven't read Aleksey or matklad's writings on different approaches for different He makes sort of an A and B explanation of at least rust-analyzer versus rustc, where rustc is a batch compiler, where rust-analyzer is sort of a live incremental compiler. There's one post that he has that's a really good proposal that's like, "Hey, if you're designing a new language now, start with an incremental server first, rather than a batch compiler, because most users end up wanting those features. Basically, he makes a proposal that says, "Everyone starts with a batch compiler because it's easy and then generally works towards an incremental compiler, and no one started with an incremental compiler first." Unless... well, I've talked to some folks about like Smalltalk more dynamic environments, where it's meant to be sort of a live system that you are playing with and introspecting and poking, where you have a more interactive view of your compiler. So if you haven't read those articles you should and if anyone's interested in what exactly you are talking about right now, they should go look at matklad's blog, which is also tremendous

Amos: That is true. However, I should mention that I have read, I think everything matklad has written on the topic. And I think everyone listening to this should read about them. I also know that Alex has experimented with other languages. I don't know if he's even active in Rust nowadays anymore.

There was a point where I was thinking, "Oh, cool. rust-analyzer is the future, is going to take over rustc because it's incremental. It's already re-implemented a lot of the same stuff." So like I was thinking of a future with like rust-analyzer, polonius for borrow checking, the new trait engine chalk, I think. But then all those illusions have kind of been removed from me because polonius turns out may not be the actual, the right way to go. Maybe there's another way to do all of this.

Chalk, I think, it's being abandoned in favor of rewriting it another time. And rust-analyzer will never replace rustc for a lot of reasons, and does not even support all the things rustc supports. And I think we've already added things to the language that makes it impractical to have that approach. So rust-analyzer does like the best it can, but it can never actually replace this.

But, I do believe that if you gave someone a bit of money, which is giving them a bit of time, they could like hack it and just have the server and instead of reading everything from disk all the time. Just that, just like have the interface be parsing cargo args, whatever, but then keeping things in memory instead of them from disk all the time.

Cause I know the Rust compiler does cache things in memory, right? It has this query system backed by something like Salsa, which rust-analyzer also has. And if you query the same thing a bunch of times, which happens a lot during compilation, it doesn't actually recompute all the things, it does memoization heavily. So, just keep that in memory, right? How hard can it be?

James: I think the people that complain about how much memory rust-analyzer use can talk exactly about how hard that is. But I'm interested to see more people hacking on it, like, a hundred percent.

Amos: Me too.

Episode Sponsor

Thank you to Tweede golf for sponsoring this episode!

This episode is sponsored by Tweede golf: a Rust consultancy from the Netherlands that you may know from their open source work like ntpd-rs and statime , or from their work organizing RustNL, the Dutch yearly Rust conference.

James has worked with them for a couple of years, and would personally recommend reaching out to them if you need help building software in Rust, embedded or otherwise, or to book a training to get your teams up to speed on topics like using async on bare metal systems.

Don’t forget to let them know we sent you!