An introduction to how the 'plumbing' of async tasks and drivers wait, sleep, and are notified for efficient cooperative action; and a glowing overview of the maitake-sync crate's main primitives
Video
Audio
M4A
Show Notes
Episode Sponsor: The postcard
crate, more info here
- maitake-sync
- WaitCell
futures
crateAtomicWaker
Waker
RawWaker
in the std library- MPSC (multi-producer single-consumer)
- WaitQueue
- FIFO (first in, first out), VecDeque
- Doubly linked list
- intrusive doubly linked lists
- cordyceps the crate
- cordyceps the fungus
#[repr(C)]
- Hazard Pointers, QSBR or "Quiescent State-Based Reclamation"
- "This is kind of a bummer..." WaitQueue Implementation Notes
- Dmitry Vyukov's blog 1024cores.net, specifically Intrusive MPSC node-based queue
- WaitMap
- Putpocketing and pickpocketing
- Postcard-RPC
- MnemOS
- Miri and loom
Pin
- Without boats, dreams dry up blog post: Pin and part 2: Pinned places
- Amos’ blog post Pin and suffering
Transcript
James Munns: So for this presentation, I've done the funniest thing possible for an audio podcast and added a visual gag. We'll see how many people end up coming and looking at the video version of this one. But this is uh, what are you syncing about? So I generally just want to talk about one of my favorite async libraries, and I am biased because I've worked on it and done some contributions to it, but it's one of the few libraries that every single time I have to use it or modify it, I've just come away... like, I don't have a better word than chuffed. Like, it's just, it works the way that I want it to work. It's designed in the way that I would want it to be designed. So I wanted to just share my excitement about it.
And that library is called maitake-sync. So it comes from the mycelium project. And this is from Eliza from tokio and tracing fame.
She was working on an operating system project for a while called mycelium. It has sort of a mushroom theme to all the names and things like that. We then kind of joined forces on another operating system project, but this comes from her original work in this area. So, maitake is actually the executor of that project. Like tokio or embassy, it is an executor. It is aimed, interestingly, sort of between those two. It's very influenced by tokio, but it's written in a way that works on bare metal or inside of a kernel kind of operation, where you may or may not have heap allocation, or it may not be the standard library's heap allocation interface if you're running the allocator in the operating system.
But it had such useful pieces that eventually she broke maitake-sync out to just have the sync primitives exposed for anyone to use. So this is published on crates.io. But this is sort of the synchronization primitives piece of maitake. So if you use tokio, it has a module called tokio::sync
.
And those are all of the synchronization primitives that tokio brings off the shelf, or embassy- the executor also has broken out a crate called embassy-sync. This is just: hey, these are sort of like the first party synchronization primitives, or things that you can use in async as building blocks. They may not be tied to that executor, it's just where they came from.
And when we're working in async code, I describe async code is all about waiting. You are waiting for a packet to come in. You are waiting for an event to happen. Your task is pending because it's waiting for some kind of event to happen. So the concept of waiting is huge, but that connection of how a task goes from waiting to ready to run is all about being notified.
So I like to think of sync primitives as being all about notifications. They're the thing that when an event happens, wakes up your task and says, "Hey, that thing you were waiting on? It's time to go."
These are useful for building essentially anything async. It could be tied to hardware like an ethernet packet. So your reactor, whatever, might want to wake up a task when an ethernet frame is received, but it could also just be totally async data structures, like a mutex, for example. If you're holding a locked async mutex, and another task comes along and wants to lock it... it can't.
It needs to wait until the other task is done. So it needs some way of saying: first, I would like to be notified, and to give that to the mutex and say, "Next time you got a spot available, wake me up." And then when that spot becomes available, it wakes up that task, so that task has a chance to come back and get access to that mutex.
Now, with how Rust async works, it's not actually passing anything along in that notification. There's not usually some, like, associated data. It's just kind of like, when you say I want to wait for something... Have you ever been to a restaurant and they have those beepers? Like, when you order food or your table's not ready yet, and they hand you a beeper, and then you go sit in the waiting room, and then once your table is ready or your food is ready, the little beeper buzzes. And they haven't given you your food through the buzzer.
You just know that it's time to go up to the desk or the counter and check. So you've got no information other than, "It's time to check again." That's how I usually model or think about Wakers, or when I'm teaching them, that's the model I like to use, because it's just a notification. Just 'time to come check again.'
So in particular, maitake-sync has three synchronization primitives that I love. The three of them are called WaitCell, WaitQueue and WaitMap. They're all 'wait' because they're about waiting or being notified for something, and then the back half sort of explains what makes them special.
So WaitCell's the most basic one.
It's a synchronization primitive that holds 0 or 1 Wakers. If you've used like the future crate or tokio, if you've ever heard of something like an atomic-waker before, WaitCell is very much like an atomic-waker. Internally we can kind of think of it like a Mutex<Option<Waker>>
When a task comes along and wants to be, it knows that it needs to wait.
The mutex isn't available, it needs to wait. So it goes, "Hey, WaitCell: here is a Waker to me, my async task, and here's my Waker. You hold onto this. Whenever that mutex is ready, wake me through that handle." It's sort of like that receiving of the beeper, of like, you kind of check in, you go, "Here's my waker, beep me whenever the thing is ready."
Amos Wenger: What's in the Waker specifically? Because I, I have whiplash from reading tokio internals and Rust async machinery in general, and... is it just a vtable with a bunch of raw pointers again?
James Munns: So the way it typically works is it has a pointer to the task that is waiting. And by waking, what you're doing is you're taking that handle of that task and adding it to the executor's run queue. So you're actually taking this task that's floating out in the ether somewhere, just waiting and sleeping.
And you take that pointer and give it to the executor and say, "Hey, this task is ready to run." So I'm sure there's some complication of how different executors do this, but more or less, the Waker is primarily a pointer to the task itself, or a handle of the task, which is then given to the executor to say, "This one's ready," or it adds it to the list of things that are ready.
Amos Wenger: Right, so, while you were explaining, I did a bit of research and actually the Waker type is from the standard library and it does, it's like a wrapper over RawWaker
and it is pretty much data plus virtual table.
James Munns: This is very simple to think about. We have storage for zero or one items. It's an option. We store the data in the Waker. So like the mutex will just have an option, or a mutex option inside of the mutex itself. So wherever the mutex is allocated, the space for that Waker is. And so when you want to wait on it, you give it the thing.
It kind of puts it in its pocket. And then when the data is available again. It reaches in it pocket. If it has something there, it presses the wake button and then throws it away. Cause it says, "I have done my job. I have notified, we can move on." And it's very simple to think about. It's used in a lot of places.
But it has a problem that it doesn't work with more than one task. So let's say we have that mutex and one of them has it locked. A second one comes along and says, "I would like access to that mutex!" We put it inside the WaitCell, and it goes off. Great! And then a third task comes along and says, "I would like to wait for that mutex!"
And so now what does the mutex do? The only reasonable thing it can generally do is reach into its pocket, if it has something there, it just presses the wake button and then throws it away. And then grabs the new Waker and puts it in the pocket. Because when that second task comes up to the front desk and says, "Hey, is access to that mutex ready now?"
It'll go, "No..." But that's kind of the only thing that it can do without losing one of the two. It just says, "I sent them a notification, but they're not going to be happy when they show up." The real problem with this is you get something that's called Waker churn. Where these two tasks will basically endlessly fight each other until that mutex becomes available, because the second one gets kicked out and then it shows back up at the front desk and says, "Hey, can I have my mutex now?"
And it goes, "No." "Okay, here's my waker, wake me when you're done!" Reaches in the pocket, throws away the other one, and these two just take turns burning CPU until eventually that first task is done with the mutex. And one of them will win. And this is sort of the other problem: it's a random throw of the dice of which one of them wins, because they're just both sitting there essentially polling the mutex in async.
They will check whether the mutex is available, not, yield, and then come back. So they're essentially taking turns running, yielding, running, yielding, running, yielding, which means at least the runtime can make forward progress because these are all yielding, but if the executor is not doing anything else, these two will just sit there and burn CPU cycles until that mutex is available again.
Amos Wenger: Yeah. It's a busy loop, because they're, you know, they're not sleeping for anything. They're not, there's no actual event system notification stuff going on. They're just trying as best as they can.
James Munns: They- they, do yield at least. So it's not all the way busy, but it's essentially like... yeah, like, an endless loop of yield now, basically.
Amos Wenger: Almost busy loop. I see.
James Munns: Which is the second worst thing you can do. It is not the worst thing you can do-
Amos Wenger: True true, true.
James Munns: Which would be like a blocking busy loop, but it is pretty much the second worst thing you can do. So, it's really useful for something if you have like a multi-producer single-consumer channel where there's only going to ever be one thing waiting on that. So on the single-consumer side, you're only going to ever have one task waiting for that data because it has exclusive ownership for it. So it works really great for that, but you can't really use it in any case- like you couldn't even use it on the multiple-producer side because it will just cause problems.
So, it's great because it's simple, but it's simple, which means it's limited.
Now, WaitQueue is sort of the next step up. It's what the name sounds like. It is a queue of waiters. It's capable of holding 0 to infinity Wakers, basically. If you were doing the multiple-producer single-consumer thing or a mutex, you'd be able to just keep tacking on new Wakers.
You'd be able to access them in a first in first out basis. So when someone frees up the mutex, you can go: okay, next in line, I notify them, then throw away their waker. And then when they're done, I can notify the next in line. It's totally fair because it's first in first out... All those kinds of good properties that you want. But before I said that we stored the Waker inside the mutex. It was a single element option. So it either was there or it wasn't.
When you have potentially infinite Wakers, where do you store these? If you're on the standard library or on the desktop or something, like you could have a mutex VecDeque of Wakers or you could do something else clever with heap allocations.
But I said that maitake and maitake-sync work on bare metal systems with no allocator. So how could we possibly store an unbounded number of Wakers?
And the answer is doubly linked lists. And because this comes from Eliza, you know it's got doubly linked lists in it. And more specifically and more spookily, it is intrusive doubly linked lists.
And this is the next mushroom name. So the intrusive list library from mycelium is called cordyceps. If you aren't familiar with cordyceps: this is the fungus that invades ants' brains and takes over them and makes them crawl up and so they can be eaten and spread the disease. But it is an intrusive data structures list. And what I mean by intrusive data structures is: when you have that doubly linked list, all the nodes, a regular doubly linked list would have sort of nodes for each of them, and each of the nodes would point to the data. So you would have your string or whatever in the doubly linked list. So you would have node, node, node, node, and each node would point to its data. Intrusive kind of smashes that together and says that each of these nodes contains the pointers and the actual string itself. You reduce the number of pointer indirections by one, which is fairly convenient for some reasons.
And this is particularly interesting to us because when you poll a Future- and we're talking in async here- so when you poll a Future, you actually get two things that are relevant here. You get a pinned mutable reference to Self and you get the Context. The Context is where we're going to get the Waker from. This comes from the executor. It says, "Here is where you get your Waker from when you want to be woken," like that's where you get the specific pointer and vtable that you'll need for the executor to do that waking. But that pinned pointer to Self is particularly interesting because we know that when the Future has been pinned, as long as it is Not unpin, then it cannot move for the existence of the rest of that Future.
We know that that Future is going to be at this pointer location for the remainder of time. Which means we could stick something inside of the task or the Future itself that has these two pointers- like point backwards point forwards- and the actual data and wire it up. So somewhat cleverly, every task brings its own storage to the list, which means as long as you were able to create the Future or create the task, then the room for that node in the linked list lives inside of the task itself. Now, the reason that this needs to be a doubly linked list instead of a singly linked list is because Futures can be dropped.
And if a Future is cancelled, it needs to be dropped, then you need to remove yourself from that doubly linked list, which means you have to grab the next one and the previous one, remove yourself from the equation and wire them up.
And this is great. We can abuse this stable pointer. And if we do some fancy things with like #[repr(C)]
and things like that, we can make sure that the pointer that we have is always reasonable to be including in this doubly linked list.
Which is wonderful, because on embedded, we don't have to like, have a const generic argument of like, "Okay, how many Waker spaces do we have?" Where if you don't give enough, then you get into that Waker churn problem, just it's- instead of zero or one, it's like: if you have spaces for four and a fifth one shows up, then you just Waker churn the whole chain, which is somehow even worse.
Or if you over allocate it, you're wasting a bunch of static memory space that you could otherwise use. And you have to carry around this annoying const N:usize
bound on everything that uses it.
But the downside is it's a doubly linked list. So you get kind of all the downside of linked lists of: less pointer locality, access is a little bit more expensive, the code is more unsafe and challenging to use. But at least as a mitigation for here on embedded, we don't have like three levels of caches where a pointer lookup might be incredibly expensive because it's a cache miss, it's just part of RAM.
It's just kind of the same as any other access on embedded, which is great. And we typically only access this linearly. For like a WaitQueue, you're always waking the first one, pulling the chain forward, grabbing the next one, pulling the chain forward. Or if you're removing yourself, you already have the two pointers, so you grab them and put them together. You're not really like trying to random access the link list, which is like the worst access pattern you can do on a link list. So that's WaitQueue. It's very neat and very flexible, especially on embedded.
Amos Wenger: I'm reading code comments from WaitQueue
as we're discussing this. And there's a very interesting comment saying, "This is kind of a bummer-"
James Munns: Oh, what part?
Amos Wenger: "That we have to use a doubly-linked list." Because there's apparently very interesting singly-linked list domains. And there are lock-free doubly-linked lists, so you wouldn't have to have the mutex around it. But they rely on, "deferred reclamation schemes, such as hazard pointers or QSBR," both of which I'm hearing about for the first time, so I'm going to shut up. But it's an interesting comment.
James Munns: Have you ever heard of Vyukov? V Y U K O V?
Amos Wenger: I, I, yes, because I just read the comment, but otherwise, no.
James Munns: Okay, he's a, he's a guy who's been doing high performance data structures and linked lists and particularly lockfree algorithms. So, for a very long time, maybe still, but a ton of high performance crates you've seen that use pointers and things like that or use linked lists: a lot of them are just using, you'll see mentions of like Vyukov's algorithm and he has a bunch of different ones, but he's had this blog, I think it's like 1024cores.something he's written sort of the standard of so many of these. And actually mycelium uses one of those for its run queue. So when you have a singly linked list, and you want it to be lockfree, you can use this singly linked list, which means that you don't need a lock to append a task to the back of a queue.
Because if you're always appending something to the back and always pulling from the front, you can actually get away with no mutex on that just using atomics. It's a little tricky and you have to be very careful, but there is like a well-defined and tested atomic algorithm that is likely to work with that.
But the problem is when you need random access to the middle, you break that property, unfortunately, because if you had two threads... let's say, removing themselves at the exact same time or two tasks running on two different threads, removing themselves, they could race and essentially you have half the chain just floats off into space because they've done it incorrectly and detached the chain.
Amos Wenger: Yeah, I see. I see what you mean.
James Munns: And I think the hazard pointer stuff is more like... kind of ref counted garbage collecty thing where you use heap allocations to do it in a way- I don't- I'm not entirely sure. But I think they're not generally applicable to a non-allocated environment.
James Munns: So then there's the third one: again, I'm a little biased because I wrote this one. But it's just the oddball weird one of the whole group. So like WaitCell and WaitQueue are essentially the building kit for building any async thing. If you are building a channel or a mutex or anything basically, even all the way down to hardware drivers, WaitCell and WaitQueue are essentially the two primitives that you need that you can turn anything into an async version of itself.
Because once you have the thing that can do something, and some way to notify and wait for things, all of a sudden you now have an async version of that driver or library structure or anything like that. WaitMap is my fun oddball one. It's not really as fundamental as the other two, but it works quite a bit like WaitQueue in that it uses an intrusive doubly linked list, but instead of just storing the Waker in the task itself.
It also stores a Key and a Value. So we have some Key of something that we're waiting for, and the Value is actually space to store the Value when that Value becomes ready. So before I described it as: the Waker doesn't really send any data. You just get a notification to go check the desk that your item might be ready.
But that requires you to be woken up and then go check and see if your thing is there. Maybe say, like, "I'm waiting for message number whatever," and the data structure will say, "Yes, I have this message for you." It's- it's kind of extra work. WaitMap says: what if we did that all in one step?
So let's say we had something like an async mailbox.
We're sending requests out, that maybe have some specific sequence number with them. And we're waiting for responses to come back. So you say like: sequence number four goes out, sequence number five goes out, six, seven, eight, but they're not necessarily going to come back in the same order. They could come back out of order, but you always know the response will have the same sequence number as the request. So kind of like the WaitQueue, you tack yourself onto this doubly linked list. But we also know that next to the Waker, there is a Value that is the Key that is essentially what you are waiting for. And then there's space for the Value. So when you start, the Value is not there because you're waiting for it. And then what happens is when that item actually arrives- so the other side, like the mailbox is processing responses and it goes, "Ah, I got response number four. Let me look through my list. Ooh, here's someone who is waiting with the key of four. I have this response. I just shove the answer in their pocket. Then poke the waker and then throw them out of the list." So essentially it's kind of like a- have you ever heard the phrase 'putpocketing' before?
Amos Wenger: I have not.
James Munns: It's the opposite of pickpocketing. So pickpocketing, you sneak up behind someone, you take something out of their pocket. Putpocketing is you sneak up to someone and put something into their pocket.
So essentially, while the task is sitting there sleeping and waiting, the mailbox can go, "Ooh, you were looking for this." It puts the item in your pocket, boops you on the head so that you know that you're ready to wake up, and then removes you from the list of things that are waiting. It is perfectly designed for a bare metal async mailbox, which when I'm writing something like Postcard-RPC- where I have exactly this request response RPC pattern- it's exactly what I need. And this is why I wrote it.
And it allows us to be very, very flexible because instead of having to have all of this storage where we have to like, because let's say we receive a couple of packets in. And then we've woken these tasks to tell them to come pick up their messages... but what if they get delayed and they don't come and pick it up quickly and I run out of space? If I don't have a heap, I'm bounded in the amount of space that I have. So what am I supposed to do with these messages? Do I just throw them away? And when they show up, then I say, "Sorry, I threw away your response. It got here, but I threw it away because you took too long..."
So this allows us to basically say: as long as you can make a request, you have space for the response because everyone's bringing their own storage for it. Which works really well on embedded because we don't have to worry about perfectly sizing the bounds. We don't have to worry about what happens if someone takes too long, and even you can essentially cancel caring about the response because if the Future removes itself, then all of a sudden when that response comes in we'll go, "Is anyone waiting for this one? Nope." Okay, we just throw away the packet or we NAK the packet or whatever we need to. Essentially, this allows us to pretend like we have a bunch of oneshot channels, but without having to allocate space for those oneshot channels.
Amos Wenger: It's interesting that this property that you like in no_std
land that like essentially all the state gets folded down into this state machine that you carry on the stack, unless you explicitly put it on the heap, which you don't have one of in no_std
land... is the bane a negative thing?
Yes. It's the bane of web developers. Because they end up having enormous Futures and they're like, "Oops, we put like six megabytes on the stack just by calling this thing that just does one HTTP request." And then you have to box things on purpose just so the stack doesn't blow up. And there's no good tooling to determine like this async block, like how big, how big is it going to be once compiled?
And of course it depends between debug and release. So it's interesting that this very, very annoying property, which is fundamental to how Rust async works. This time, this one is, is, you know, both annoying for high-level web code and really essential for low-level bare metal code.
James Munns: Yeah. And it's like you said, it's one of those things of scale that when you're waiting on one kind of thing... in moderation, it's fine, and it's reasonable and on embedded, you probably have moderation because you don't have much to go around anyway, so you're not gonna be boxing hundred kilobyte chunks of data or megabytes of data. You're like, "Well, I have my one space of 128 bytes that I'm gonna put my response in, and I'm excited about it!"
Like, yeah, on embedded a lot of times we either statically allocate our tasks, so we go: look, there's only gonna ever be these five tasks existing, and we have them in statics, basically. So they are statically allocated. Or, like in MnemOS, my operating system, we do have an allocator, but we'll sort of just allocate all the tasks up front most of the time when we're creating drivers. And then they just live unbounded more or less. It's less dynamic usually than you would have on your desktop.
But yeah, it's very flexible and we don't need a bunch of oneshot channels.
But now the downside is again that it's a doubly linked list, but actually sometimes a little bit worse because this isn't a real hash map. We are linearly searching the doubly linked list for which waiter is ready.
Amos Wenger: Right.
James Munns: Which for small Ns, which you probably will have on embedded is fine. Especially when the cost of following pointers isn't too expensive. On the desktop, you might want to not use a linked list for all the reasons that you might not want to use a linked list on desktop. But on desktop, you probably have a heap, and you probably could just use a real hash map of all of these items rather than a linear search.
You could use the same intrusive trick for storage, But also on a desktop, you maybe just don't. You could just use a oneshot channel, and it wouldn't be a problem. So, this totally can work on desktop, but if you get to the point where you have, like, 2,000 in flight requests at the same time, you're going to have horrific performance, because you might have to search the entire 2,000 waiting tasks to figure out: Oh, no one was waiting for this response, and I just wasted a whole lot of time and cache busting walking this doubly linked list.
Amos Wenger: Well, what if you have more tasks than is practical for a linear search, but still don't have a heap? Clearly the solution is to move from a doubly-linked list to some sort of tree data structure that is self-balancing, don't you think?
James Munns: I think cordyceps either has or is working on an intrusive tree as well. And I've thought about it, but it makes my brain hurt. I'm very lucky that Eliza wrote most of this code, especially cordyceps. And she is proficient in both Miri and loom. So Miri is sort of a runtime sanitizer that checks if you've done undefined behavior, basically.
Amos Wenger: It is- it is worth noting that I wasn't serious. I was just trying to be annoying. Like, "Can you imagine?" And then the response being like, "Actually, Eliza's probably working on it." Is, is par for the course. It's what I would expect.
James Munns: We should have Eliza on as a guest just to explain cordyceps or intrusive linked list because they are tremendously useful and scary at the same time.
Amos Wenger: Sure!
James Munns: But yeah, this is one of those things that, like I said, in moderation it's good. It gives you the kind of flexibility that feels unreasonable, but with a little knowledge of the Dark Arcane and how Pin
works and how Futures work, you realize that all of a sudden, with a little bit of a stable pointer and #[repr(C)]
, you can start doing the very spooky things that you'd like to be doing and get the kind of safe flexibility you'd like out of it.
Amos Wenger: Speaking of Pin
, I know this is, this is your outro and I'm ruining it, but Boats has a great article about it.
James Munns: It's really good and there's going to be a second part that I am also very excited about.
Amos Wenger: Yes. And I'm going to link to it.
James Munns: Yes, please do. And this is actually, I caught myself because I was going to say, if something's pinned, it can't move. But that blog post explains exactly the point where that's not exactly true of Pin
only has that property if the item is not unpinned. So if it cannot be unpinned before the end of the life cycle, but yes, that is a very, very good post.
And Pin
is one of those things that is very load bearing in the async world and also tremendously not well understood particularly by its critics and uh... it's a very neat thing but it is a weird sort of programming language theory brain topic which is unusual for a lot of Rust.
Amos Wenger: Yeah, I, I have tried to explain Pin
, in 2021. And... I wonder what will happen first, whether the majority of Rust users or consumers will understand Pin
or whether we find a way to boot it out of the language. I'm, I'm genuinely curious because Boats announced an idea. So I'm excited about the idea as always.
Episode Sponsor
This episode is sponsored by the postcard
crate. Postcard is gearing up for version 2.0
of the library and is looking for sponsors for this release. If your company uses postcard,
you can help fund a portion of these efforts.
Postcard is a compact, binary wire format, that is compatible with serde, and allows devices big and small to quickly and efficiently serialize their data. Postcard is used widely in embedded Rust projects, as well as for projects like ICU4X, bevy, and wasmtime.
The 2.0 release is aimed at simplifying the library API, while not changing the wire format,
meaning postcard 1.0 and 2.0 are fully compatible when it comes to messages you send and
receive. This version will also stabilize the Schema and MaxSize attributes, which are
used by other crates in the postcard ecosystem like postcard-rpc
.
For more information about sponsoring postcard 2.0, check out this blog post or send an email to contact@onevariable.com to get in touch.