Thread Locals Galore

The only thing more evil than one singleton is multiple singletons

An overview of Thread Local Variables, and the challenges they pose for experimental dynamic linking support

View the presentation

Audio

Download as M4A

Show Notes

Episode Sponsor: Ladybird web browser

POV: I'm on my third coffee and you just asked me how the internet works
Direct Memory Access Slides
locals, heap allocations, thread-local storage
alloca()
VLA (Variable-length array), FS segment
Drop, const, needs_drop
LocalKey
Tokio
singletons
crate-type= dylib, cdylib, rust-analyzer
loom, tokio_current_thread, multi thread runtime, weak symbol, tracing-subscriber

Transcript

Pre-presentation chat

Amos Wenger: We're getting better at this, but I hate how awkward I am. I'm impressed. So I spent the whole morning working on a video script, and that was very smooth and very well spoken. Then I sign on to this call with two Americans, and I'm like, "Oh, is my accent okay? Can they understand me? They can tell I'm French, and I'm like, I know it!"

James Munns: You were very well spoken.

Amanda Majorowicz: Oh yeah.

Amos Wenger: Yes, thank you.

James Munns: So do you have like your punchy sentence of like, "I'm gonna focus on this"?

Amos Wenger: No, I'm not going focus on anything ever. That's your thing. No, I'm jealous of your slides because I have actual bullet points still. You have like, catchy sentences that drive you on each slide. But at least I'm looking at the presenter view right now so I know what's coming up next. But mostly I make slides so that I can ignore them later.

James Munns: Yeah. I make slides that are like subtitles basically, and they're just punchy one sentences. So like yours, if I had bullet points, I'll just make like nine slides for that.

Amos Wenger: Yeah, yeah, exactly.

Amanda Majorowicz: Well, and your, your thread, it says "thread" in the title. So that's the thread.

Amos Wenger: It's, well, it's thread-locals. It's a single unit.

Amanda Majorowicz: Do I know what I'm talking about? No. The end.

Amos Wenger: No, neither do I. Here's the other thing, is that I'm always talking about something that I just found out about. So, most likely this podcast is going to be a whole series of, "Never mind what I said last episode, I didn't know what I was talking about, I have been corrected."

James Munns: Engagement bait! Engagement bait!

Amos Wenger: But not on purpose! I'm just learning things. I got something wrong in a video and it's been haunting me. I haven't published anything since in months because I'm like, I got it really wrong. And what do you do? You're not going to delete the video. YouTube doesn't let you just add a errata. I have a pinned comment, but nobody reads those. So I just have to live with it.

James Munns: Well, we can add a section to the front of these eventually of just "Corrections From Previous Episodes."

Amos Wenger: Yeah. Yeah, we can do that.

Amanda Majorowicz: Yeah, just like in Time magazine, they're like, "Yes, corrections..."

Did somebody point it out to you though, the mistake or whatever?

Amos Wenger: Yes, a friend.

Amanda Majorowicz: Oh!

Amos Wenger: A friend who I hadn't spoken to in weeks, uh, messages me on Signal saying, "Uh Oh," I was like, "What do you mean 'Uh oh?'" It was like, "Well, this is not how a TCP/IP works." I was like, "Oh, well..."

James Munns: It's funny, I posted my DMA slides in the chat and someone "Well actually'd" me on that. They're like, "Is the peripheral-" So like, this is the whole thing that I didn't get into- You get into these things called interconnect matrices, where there's actually multiple buses and multiple accessors. So you have the accessors, like the CPU and the DMA like this, and then there's the different buses that they can access.

And there's actually- a big part of this arbitration is just, which matrices are you going through? And in a lot of current architectures, peripheral bus is not something you directly connect to. It's actually there's an adapter- so that memory bus that I talked about was something called the AHB or the high speed bus, and then there's the APB or peripheral- I dunno.

Yes. I had people "well actually" me. But then it was funny 'cause the room circled around to even more "well, actually-ing" where we started, "well actually," and we realized that like, even on common microcontrollers, they're all implemented totally radically differently.

Amos Wenger: For the benefit of the listener, if this is kept, I hope it is: uh, alt text for looking at James' hands. James' hands briefly became an interconnected matrix of various buses.

James Munns: Four fingers, one vertical, one horizontal. Yeah. It's a matrix.

Amos Wenger: Yes, of course. But it was not any clearer to me, looking at the hands versus not having the hands, if that's any comfort.

James Munns: I'll send you some diagrams. Cause we went and looked up reference manual, like data sheets and stuff like that. Cause some of them do explain how those are wired up. But yeah, this is, I guess the pre talk for the... we have post talk I guess. I don't know.

Amos Wenger: That's fine. That encourages the people to actually listen to every episode. Cause I was like, "Oh, I missed something in the previous one. So that's, that's smart as well."

I should do the same and I kind of have the community to do that now, even though I've pulled people from all sorts of different places.

So actually no, like maybe one of them if they're awake at that time, but not everyone like you. And also I like the effect of surprise, so even just the slide deadline that we've set. I'm like, "Ah, he's gonna read my slides and then the surprise gonna be gone." Like I want to go next line and people go, "Whoa!" See I want that. That's, that's what I aim for in my videos. Also if I talk about something that, uh, a project I haven't finished, I already get the reward for doing it, even though it's not finished, and then I give it up. Which is not the case here, because there's a lot of social pressure to show up.

James Munns: My hope for this is that it's dopamine drip.

It's been super, super nice to just ping you, like, once a week and be like, "Alright, I got three things on my mind. Which one's interesting?" And you're like, "This one." And I'm like, "Dope!" That's all I needed. I just need someone to tell me that they wanted to hear me talk about it and now all of a sudden I'm fired up to talk about it.

Amos Wenger: Yes. I think my reply was one word, so it's proof that it really doesn't take much.

James Munns: Yeah. Captive audience.

Amos Wenger: All right. Well, mutually assured slides or whatever. Oh, this should have been the podcast name! Mutually Assured Presentation or something.

James Munns: There you go.

Thread locals galore

Amos Wenger: Oh, wow. All right, today I want to talk about thread-locals, um, again.

James Munns: The descent of madness into thread-locals.

Amos Wenger: So the title of my presentation is "Thread-locals Galore," and the subtitle is "the only thing more evil than one singleton is multiple copies of that same singleton."

Obviously. So let's talk about variables.

Types of variables

Amos Wenger: Variables might not be variables, some of them are const. Uh, don't pay attention. That's just how we name things around here. There's such a thing as local variables, which are usually stored on the stack, unless you're not looking and then the optimizer might put them in register. Unless you want to debug and then it's using debug information to know which register to look in and pretend it's actually on the stack. Whatever.

James Munns: Then there's heap allocation. The heap, as opposed to the stack, is a large area where we put things, so kind of the same, but we don't put them in order, as the name implies. It's kind of disorganized, that's why it's called 'heap' but actually it's very very well organized because the allocator knows where everything is. You can free things you can allocate things, even if you didn't know the size at compile time, you can decide the size at runtime like: Oh, this time it's going to be an array of 128 elements. And this time it's going to be 64. And I didn't know that at compile time, but it's okay. Cause the allocator is here to find me some free space. If there's not too much fragmentation. You could do that on the stack too, in C, but not Rust. Cause C has a cool thing called alloca(). Which is basically like- it's using your stack as a bump allocator. And the reason that Rust doesn't expose it is because it's wildly dangerous. And uh, maybe a little less dangerous in Rust where we have real lifetimes where you'd be able to tell. But a lot of people just avoid it because it's very easy to accidentally hand back a pointer to your stack, and: oops, stack corruption. But yeah, heap is usually where we put the dynamically-sized stuff now, at least.

Amos Wenger: Yeah. See, that's not even in my bullet points, but yes, that is correct. I remember uh, VLAs. variable-length arrays from my C days. I used to make a language that compiled down to C, so that was, that was a thing. Next up, so we talked about locals, we talked about heap allocations.

Third category, statics: stored in memory-mapped sections of the executable. Surprise, executables are files, but they're also mapped in memory. And some part of them is code, which we call text section, because of course we do. And then, data and constants and whatever are stored in other sections, whose names I forgot because I didn't study.

I don't want to call them statics. I want to call them process-local because their lifetime is essentially that of the process. So the executable is mapped into memory and then started, and then you can be sure that that memory area is reserved from the beginning of the process to the end, to its death.

And then. Fourth category, we have thread-local storage, which is a lie, largely. It's essentially just like process local storage, except relative to some address and that address changes every time we switch to a different thread. And that is arranged either by the kernel or the kernel or the kernel on different registers depending on different architectures.

How do thread-locals work?

Amos Wenger: The last time I studied this, I was on 64-bit Linux. That's a few years ago. And, uh we used the FS segment register back then. I have no idea what's happening on 32-bit Linux. It's probably different. I am now on 64-bit ARM macOS. And ChatGPT told me that the TPIDR_EL0 register is being used. James has input? No?

James Munns: Well, this is all desktop stuff. I hang out in microcontrollers- microcontrollers don't have thread-local storage because we don't have threads. So, makes sense to me.

Amos Wenger: Lucky you. But basically yeah. Every thread-local has an offset. So some of it you can determine statically. Let's say you have three variables in your executable and they're all marked as thread-local. So the compiler annotates those accordingly, the linker puts everything together. It adds up all the thread-locals that everybody knows about from all the objects and then makes room for them in the executable and then you have room for all these, but then you can also, of course, allocate thread-locals dynamically, by asking nicely the operating system. Which is the libc, which is the allocator, which is the threads runtime. It's all the same thing. You, you think there's such a thing as like different libraries on your Linux system, but it's all libc, right? It's libc and and whatever UI framework you're using.

Cool. So that's easy, right? Uh, pretty much. You have all those thread-locals in the same block, and there's one of these blocks per thread, and you just change the base address whenever you change the thread. You don't even have to worry about it. You don't actually do it. The kernel does it.

And that's it, right?

Complication 1: some types implement `Drop`

Wrong. Because, we have to worry about the case where the thread ends. Those things happen. Uh, you start a thread, and then it ends, and in Rust, we have a thing that has been, widely regarded as a bad idea and made everyone angry. It's the Drop trait. No, it's a very good idea. James is frowning.

Amos Wenger: We have the Drop trait, which means it's a destructor. Whenever you drop a value, nobody owns it anymore. Nobody has a reference to it anymore. It expires. It falls out of scope. Then you can run some code to clean some things up, which is very useful if that value is your view of some hardware resource. So like, I don't know, a network connection, an open file.

James Munns: So yeah, I guess the difference between these and statics- so statics are interesting because they live forever forever, so as far as you're concerned, like, the destructor will never be run on a static, but thread-locals have to both live forever, at least apparently to each thread they have to live forever, except for when the thread dies, but like, so it has to live forever except for until it doesn't, I guess.

Amos Wenger: Yes, exactly. In fact, inside the internals for thread-locals in Rust, there's a thing that says: Well, we say 'static, but it's not actually true. It's actually slightly shorter than the lifetime of even the thread, which is already less than the lifetime of the process, because we need to run the destructor at some point, and then that'll be the end of that.

And so 'static is a double lie in this case. So that's the first complication. Some types implement Drop, so we can't just like, free the underlying storage and just not refer to it anymore. We have to know the type of every variable that's... every thread-local and run their destructor if they have one.

Complication 2: some types' initialization is not `const`

Amos Wenger: Complication number two, related to birth rather than death this time: sometimes you know exactly what the byte pattern of some value is going to be. So if you initialize a signed 64-bit integer to 42, you know exactly what that looks like. You can bake that into the executable, you can memory-map that. It's const, it's beautiful. You know exactly what it's going to be, but sometimes you don't. Sometimes you have to run some code at runtime. I'm going to say "run at runtime" a lot. That's okay. It's part of the process. You have to run some code at runtime to figure out what the byte pattern is going to be. It's not const. It cannot be evaluated at compile-time. It has to be evaluated at runtime.

In which case, what you do is hopefully you know how much storage you need, so you reserve that. And then you have a thing that every time someone tries to access that thread-local, it goes, "Wait a minute, let's check the state. Has it already been initialized? If not, let's run the constructor." (I guess in that case, even though we don't explicitly have constructors in Rust). "Let's run the initialization code, which is going to put the initial byte pattern doing whatever it needs to do. And then you get to actually borrow it." And if you put those together, you have a whole lifecycle of like, this thing is not initialized yet. This thing is initialized, you can have it.

We've already run the destructor, why are you even trying to borrow this?

Complication matrix

Amos Wenger: The way this is all implemented in the Rust standard library is actually pretty smart. And, uh, I was actually pretty impressed with it because I can't show you code because you're just listening to this, of course, but I'm looking at an array, a 2x2 matrix of whether a type has a Drop implementation or no Drop implementation. James is making finger motions.

James Munns: Sorry, I was throwing the 2x2 matrix gang sign. Yeah, sorry. As opposed to the 4x4 AHB matrix axis gang sign. That'll only make sense if you listen to the last episode or whatever, however it's oriented, but it's an easter egg.

Amos Wenger: So I'm looking at it- there's two dimensions, right? There's 'needs Drop' or 'doesn't need Drop,' and then there's 'initialization is const' or ' initialization is lazy,' and that gives us four different scenarios. And in the best scenario, which is 'doesn't need Drop' and 'initialization is const,' then we just use whatever the compiler does, and you're going to ask, "But isn't the compiler rustc?" No, it's LLVM. So you just tell LLVM, it's a thread-local. And then it does the thing I said, which is like to statically reserve some storage in the executable. And the linker also knows about that. And the dynamic loader also knows about that. The operating system, everybody knows about what to do, except for humans, because there's like 13 people who have gone the length of understanding how this works.

Ultimately: one single interface

Ultimately, it all comes down to this one single interface, which is called LocalKey. And I have to zoom into my own slide here because Jesus, that's a lot of comments. LocalKey is a struct that has a single field and the field is not the value. The field is a function, not a closure a bare function that takes an option to a mutable reference to an option to the value. I know it's hard to, if you pay for the, for the 10 bucks a month tier, you could actually see the function. So just go, just go to the standard library, type in LocalKey and, uh brace yourself. And it returns a const pointer, not a reference, to T. And there's a comment saying, the comment is talking about saying, "Well, we say 'static, but it's not actually 'static," even though there's no mention of 'static here in the function signature, it just returns a pointer. There is a 'static up here, and like, T is supposed to be 'static, it's supposed to be owned

James Munns: So it's a function that takes an optional pointer to a space that might hold the thing, so I guess two layers of indirection here? I love the comment to code ratio because there's like three lines of actual functional code here, two configuration blocks on it, and then like 15 lines of: "Okay, prepare yourself.".

Amos Wenger: Well... Yes. Because... essentially, all this does, like, LocalKey is really just: Here's a function that gives us the address of the thing you're looking for. This is all that it does. But in the best possible case, in the simplest case, the compiler can see through that function — because it sees all your code — and it's like: okay, so I see a lot of indirection. I see like we're calling a function that returns the address of the thing, but we know statically that the thing is there and you're using it, so let's just ignore the functions, just inline everything, and this is just a regular access at this point, and then you can inline some more, and then... so this is what the comment says as well, which I don't love, because I don't love when comments say, "Well, the optimizer is surely going to take care of this," and then, you know, the next point release of Rust is like, "Well, it turns out it didn't," and now everything is faster, or slower, or whatever.

Where are thread-locals used?

Amos Wenger: So, now that you perfectly understand how thread-locals work, which is you know: Everything is relative to some base address. The base address changes when you switch threads. And in Rust, really, a thread-LocalKey is just the address of a function that gives you the address of your thread-local.

And that works for all possible scenarios. So now that we know all that, where are thread-locals actually used? What are they useful for? Well, for data that's local to a thread. So asynchronous runtimes, some of them- tokio specifically- does that, because sometimes you want to sleep for a few seconds and to sleep for a few seconds, you don't block for a few seconds in async code. You tell your executor? I'm looking at James for approval. You tell your executor, no, your reactor!

James Munns: Well, you asked the executor, which asks the reactor.

Amos Wenger: Anyway, so sometimes you have a future, you want to sleep, but you don't want it to block, so what do you do? You find out what the ambient- or I like to call it ambient, I'm the only one- the current runtime is. Which is stored in a thread-local. And then you say: Okay, wake me up in whatever, two seconds, and then you yield. And then in two seconds, hopefully, it polls you again.

Anything with a register also uses thread-locals. So for example, tracing-subscriber, which is nice because it lets you have separate threads within a program and separate subsystems, essentially. So you can have like threads A, B, and C, and they all have their own subscriber... handler?

James, yes.

James Munns: So the reason that you have a unique one of these per threads instead of just one static where you just say, okay, the tokio runtime lives at this static or the tracing subscriber lives at this static. Is that to keep the cache locality better? Like, is it better to have one of those per thread versus one for the entire process?

Amos Wenger: That's a good question. I think it's specifically to allow you to have multiple unrelated runtimes with their own thread pools and whatnot. So I don't think it's a performance thing.

James Munns: Needs more research, looks like. Amos has a very confused look on his face, but he's deep in thought.

Amos Wenger: I'm 97% sure.

Sane case: single binary

Amos Wenger: So, in the normal case, I call this the sane case, but in the normal case, you have a single binary, you're happy, you have one crate that depends on 700 other crates because you're doing web things. That's speaking from experience.

And then you have one copy of tokio's code baked into your binary, which includes the CONTEXT thread-local, which stores the current runtime. So at the beginning of your program, you create a runtime and then it sets that thread-local to it. And then it starts a bunch of threads for its worker pool. And for each of those threads, it also sets that same thread-local to the same runtime. Like any- anything you start doing from that async runtime will happen on a thread that has this thread-local set to that handle to that runtime. That is the sane case because there's only one copy of your singleton. A singleton is a variable that you're only supposed to ever have one copy of, and it's, it's been deemed evil in the past, by people who don't know about hardware, I guess, I don't know.

Sane case: crate-type = ["dylib"]

Amos Wenger: And then you have a second normal, completely normal, sane case, which is something not many people know, and I didn't know I think two weeks ago, when we first started going down that rabbit hole together on that podcast, which is that you have something called crate type dylib, not cdylib, I knew about this one, I've been using Rust to make crimes and load them into regular C programs for a long time now, but dylib just means: compile this code as like a Rust dynamic library. Do not care about ABI stability. This will break across compiler versions, but emit it as a shared object. So it's going to be a .so on Linux, it's going to be a .dylib on macOS, it's going to be a DLL on Windows, and then have the executable depend on this and load it at runtime. And in this case, if you had- so your executable depends on the crate, which is of type dylib, which itself depends on tokio- what cargo would do is that it would pull out tokio into a dynamic library and have both your binary and your dependency, like all the things would then start linking against this dynamic library.

My case: crate-type = ["cdylib"]

Amos Wenger: So that's, those are the two normal cases so far. And then you have my case because... I don't know why I bring this upon myself, but I have been splitting a big project of mine, the thing that runs my website into separate modules. So I'm not using crate type dylib or dylib. I'm using crate type cdylib, which means that every module of my website- I have one to compile Markdown to HTML. I have one to render LaTeX to math equations. I have several modules like that, one for CSS, etc.

These are all actually separate Rust projects and I built them separately and they have nothing to do with each other.

James Munns: But, okay. If you're still Rust talking to Rust, why are you going for cdylib instead of dylib?

Amos Wenger: That is an excellent question. So, if I was doing dylib, I would still need to have all the code of everything in one place checked out at the same time. So it's like one big repository, which I do, but that's, that's one thing. The second thing is cargo would parse everything. It would keep everything in memory.

So would rust-analyzer. It would take gigabytes of memory to do that, which it used to. And then whenever you change the tiniest thing, everything recompiles. So for example, all my modules and my main binary depend on one crate, which has the basic set of types and traits that actually define the interfaces between the modules and binary.

And whenever I change that, I don't always need to recompile all the modules, right? Maybe I'm changing the effects for another module or something. But if it was part of the same crate graph, cargo would definitely go, "Oh, okay. The one dependency everyone has in common changed, time to rebuild the entire universe."

James Munns: Right, right, right, because you're okay with saying, " I still promise that I'm only going to use one compiler." But you're getting rid of the promise of, "I promise I'm going to do all of this compiling at one time."

Amos Wenger: Exactly.

James Munns: Makes sense.

Amos Wenger: In fact, the way I built it, none of the modules are built by the time you first run the debug binary and it finds that out and it shells out to cargo and builds everything dynamically and copies it to the right place. And does weird linker stuff. I- I'm weird. Okay. I like linker. You, you learn about something and then you use it.

James Munns: That sounds like a whole new episode and I am excited for that episode.

Amos Wenger: But the problem with my case, which is not normal, or whatever, is that now you have N copies of tokio code. You have one in the main executable, the binary, you have one in every module, which has been built independently by cargo. And you have not only N copies of tokio's code, which would be fine, it's wasteful, but like, whatever, it's only a few hundred kilobytes, it's on a server, I don't care.

But you also have N copies of tokio's CONTEXT thread-local, which is much, much worse!

James Munns: Yeah, your singleton is now a multiverse.

Amos Wenger: Yes, depending on which version of the tokio code is running. So of course the first problem- which I haven't even put in the slides- is that you have to make sure that all the same features of tokio are enabled. Otherwise the layout of the internal data structures of tokio is going to differ because some fields are only present if some cargo features are enabled.

But assuming you get that correctly, in my case, the solution was just enable all possible features for all possible modules, even if we don't use them just to make sure that the binary layout is the same. Then you still have the problem that you can start a tokio runtime from the binary and then load a module and then invoke an asynchronous function from that module.

And then if you call RuntimeHandle::current() from the main binary, it's going to be: yeah, we have a runtime. And then from modules can be: no, we don't have runtime. Because they're not checking the same copy of the CONTEXT thread-local, because there are different slots because there's N copies of it, which should never happen.

So what do you do? Well, you don't have to rely on the current runtime. You don't have to rely on the thread-local. Pretty much everything tokio lets you do, it lets you do on a specific context or executor runtime or whatever. You have a top level tokio spawn function, which does use the current runtime, but you can also have a handle and then call spawn on that handle specifically. And then it doesn't rely on the thread-local. The problem, of course, is that you're one of the very few people who now cares about this. And all the crates are like: no, it's fine. We can just use the ambient runtime, so good luck trying to patch the entire world.

Solution: pass tokio context explicitly

Amos Wenger: So if you can't pass an explicit executor, then you could just, because you control the boundary between the binary and the module, you just say: before switching over to the module, let's set their thread-local to the same value as our thread-local, which is a thing you can do.

And then every time the future they returned is pulled, let's also restore like that to our thread-local. So you're manually synchronizing thread-local values, which works until one of your module's futures spawns a task, that spawns a task, and then you've lost the chain. And that second, like, I don't know, grandchildren task does not actually run on the, on the right runtime. So, this is not an actual solution either.

Solution: patch tokio

Amos Wenger: So, as much as it pains me to say this, because I was really trying to get this all to work without patching tokio, but the solution is, of course, to patch tokio. When I complained about this online, on Twitter, or Mastodon, whatever, someone tokio-adjacent said, "Oh, we should just really have a feature that you can enable, and it makes this magically work across shared object boundaries," which is exactly what I want.

But this is a bit more complicated than it first appears. Luckily for every thread-local in tokio, they don't use the standard library thread_local! macro, which is what you would usually do, and then you would end up with a LocalKey, as we've seen before. They have their own tokio::thread_local! macro, because they have a thing called loom that allows them to debug async code, essentially, I don't actually know how it works. I just know they have their own thing for tests only. So in testing, it uses the loom version of thread-locals and in production, it uses the standard library version of thread-locals. So this is great because we get to just redefine that macro to do whatever we want.

And in this case, I've redefined that macro to: instead of initializing a LocalKey with something that returns the address of an actual thread-local. Don't reserve a thread-local slot at all, if the feature is enabled, and just have that function return the value of a static mut. This is why I was asking about static muts-

James Munns: Ah...

Amos Wenger: Which is first set up when the module is loaded. So only the binary has the thread-local, and then you load the module and you call the function that says: okay, set your LocalKey function getter to this address, which is a function I export. And then that way, when it's trying to use the thread-local, it's not actually using a thread-local from its own object, it's just calling a function that you control that returns the address of the thread-local from the binary. This is very, very complicated to explain.

James Munns: So you're making it sort of like an extern definition. So like in the actual library, instead of defining the static or the thread-local yourself, you're defining an extern. And then when you load it dynamically, because if we were static compiling and we did an extern, the linker would be the one that says, "Ah, this exists in another object file," but at the end I go, "Okay, this is this one."

But when you're doing it, you leave it essentially externally defined when you're finished making the dynamic library, and then it actually gets resolved at dynamic linking time, so like when you're loading the library. So in the root binary, the actual application at the bottom, you have to make sure that it's the only one that ever defines this symbol and then all 50 of your modules or whatever you're bringing in, they all have extern definitions, but as soon as you load them, they get mapped onto your... that's super cool. I've- I do a lot with static linking because in embedded, again, we don't do dynamic linking very often, so dynamic linking is magic for me, but that's something that makes sense to me from static extern to like dynamic extern, which I didn't know you could do.

Amos Wenger: So this is almost it. So the initial idea is the thing I just described, which is: okay, you have a static mut somewhere and you have to call some initialization function in the beginning to give it the address it's supposed to look at. But then I figured, yeah, well, if you forget to call, if you call it twice or whatever, it's not really good. So what I did end up doing is what you see in the slides, which you described, is: you refer to a symbol that is not defined in this library. And then it is going to be present at load time. So when the library is loaded, that symbol is going to exist. And then it's just going to call a symbol.

So there's no initialization to miss. You don't have to call anything. If the dynamic loading happens properly, then it's okay. It's fine. But there's still several problems with that. First of all, you cannot apparently- or I didn't find the way to do it, I tried a bunch of different things- you cannot export dynamic symbols from a binary in Rust.

I'm pretty sure it's possible because we've made up that distinction between binaries and shared objects. They're both just objects and the, the binary does export a bunch of symbols. It has an entry point. It has a bunch of different symbols that are exported for reasons I don't quite understand. So you could totally do it, but I couldn't get rustc and the linker to cooperate and make that happen.

So I had to involve another crate called tls-slots that only exports that symbol. And so that makes the dynamic linker happy. And then the other problem is that leaving a symbol undefined, makes the linker unhappy, even if you're creating a shared object. So you have to, you have to pass a specific linker thing saying, "Well, you're not going to find some symbols, just look them up at runtime," which is a very dangerous, very global, like nuclear solution, because it could be that half the symbols you need are missing and you only find out at load time.

So hopefully it's just the one. I'm sure there's a much cleaner way to do this. It's just what I could hack last night. It was like 10 PM. I was like, I need to finish my slides!

James Munns: If it's stupid and it works, it's not stupid.

Does it work? Almost!

Amos Wenger: So of course, now you want to know, does it actually work? Can you like get several shared objects to cooperate and all use the same tokio runtime?

Not really. Still. Because thread-locals are just half of the problem. You also have process-locals. And I used to think it didn't matter, and actually if you're using the current_thread runtime, the simpler of the two, it does not matter and it does actually work. But if you use a multi-thread runtime, it has a bunch of atomic statics that it updates, and some of those are like number of parked threads, or like number of things that are waiting for that.

And I think it's checking them at various places in the multi threaded runtime going, "Oh, it's at zero. We don't need to do anything." So actually sometimes my code gets stuck and it helps if from the main binary before calling into the module, I spawn a task that's just busy looping. Not really. It like, sleeps for 10 milliseconds in a loop, and then that has the runtime check on all the tasks again, every 10 milliseconds, and that helps the program actually make progress. So no, the solution does not actually fully work right now, but it's, you know, it's one step closer. I just have to take care of process locals now, which are really globals.

James Munns: Yeah, I was going to say, like somewhat common thing- so again, this is static linking brain. Static linking might have something like, they're usually called weak symbols. So weak symbols means either I can provide it, but if someone else provides it, you like defer to them. But I have no idea how weak symbol resolution for dynamic libraries would work, but it sounds sort of like what you want-

Amos Wenger: That is, that is what I tried first. And I discovered that not only is it, of course, perma unstable in Rust, but also there's like three different variants of it, which map just to what LLVM does. And half of it is broken. And even in the standard library, they're misusing it. I can give you links, but like the discussions are like: none of that makes any sense. Do not use weak linkage in Rust. That's a bad idea.

James Munns: If you'd like to know a fun trick: the linker can weaken symbols itself. So the compiler can produce non weak symbols and you can come back with the linker and re mark the sections as weak. Although, I think the reason that it's unstable is because if the optimizer came in and just like pretended that static isn't there, or cached a value because it thinks it never changes. Then I could see it miscompiling or causing undefined behavior, but when you start getting into binutils and linkers, you can start doing extra spooky stuff, because linkers are like... oh, man.

Amos Wenger: Yeah, I know a bunch of Linux specific tricks, but I also use macOS to develop this, so it has to work on both platforms. So some things I just didn't even try. One thing I wanted to try that seemed universal is just go ahead and patch the code. Just patch whatever function gets thread-local. The problem, of course, is inlining, because you're patching a function that might not even get called from the call sites, it's all inlined. Specifically if the thing is all const, I was like, "I'm just going to override the memory." And it's like: no, because nobody's reading from that. We've been learning everything a long time ago.

James Munns: If you make it a pub static so if you export the static, then the optimizer can realize that it's an exported symbol, and that it can't mess with it that much. So, if you do replace it with like a tokio::static! instead?

Amos Wenger: That's what the macro does when you enable the external TLS feature. Yeah, yeah, but it was a lot of trial and error because at first I was like: okay, so trying to override a const. It was like, no, and then exporting a static, but a static function? Exporting a function and overriding it was like: again, no, you can't do that. And then, okay, it was static-

James Munns: Spooky.

Amos Wenger: Mod, pub, whatever, export all the things, #[no_mangle], pretty please with a cherry on top? And then finally it started working.

Next steps... will this get upstreamed?

Amos Wenger: So yeah, future work: get it to actually work and then apply the same treatment to process locals or actual statics and see if everything works. I want to do the same technique for tracing-subscriber because right now I have to do that same manually synchronizing thread-locals which is really annoying and really error prone. And I'd like to get this into tokio, but that's another curt scenario for them to worry about and a bunch of code and it seems unlikely right now.

I might try because it's not fun to maintain a patch set against, you know, the most popular Rust executor, but, uh, yeah, maybe, I don't know. I'm going to keep trying to make this cursed thing work because it's really, really nice for me to be able to build, uhh package up, and deploy my website in under two minutes all around the world.

This used to take 15 minutes easy, and it's just- you start thinking about whole different things when you have that kind of iteration speed. So I'm going to keep trying because I'm stubborn.

James Munns: I'm super excited, because this is one of those things where you figure out the cursed way, and then you show the internet the cursed things that you do, and you find someone on the internet who goes, "No, no, like this," and you find the better way to do that, and then you figure out how to wrap that in a tool, and then all of a sudden: Oh yeah, that thing we said that Rust could not do for a very, very long time, of having dynamically loadable modules and stuff like that? Oh, now it's fixed. Like, I love that kind of tooling, even when it starts as like, spooky, unsafe stuff, but if we can figure out a way to, to ship that. That'd be super, super cool.

Amos Wenger: That's it.

James Munns: That's a podcast!

Amanda Majorowicz: Doop, doop doop doop, doo.

This episode is sponsored by the Ladybird browser.

Today, every major web browser is funded or powered by Google's advertising empire. Choice is good, but your only choice is Google. The Ladybird browser wants to do something about this. Ladybird is a brand new browser and web engine written from scratch and free of the influences of Big Tech. Driven by web standards first approach, Ladybird aims to render the modern web with good performance, stability, and security.

From its humble beginnings as an HTML viewer for the SerenityOS hobby operating system project, Ladybird has since grown into a cross-platform browser supporting Linux, macOS, and other Unix like systems.

In July, Ladybird launched a non-profit to support development and announced a first Alpha for early adopters targeting 2026, but you can support the project on GitHub or via donations today.

Visit ladybird.org for more information and to join the mailing list.