BBQueue - Going Just Far Enough with Generics

RSS podcast badge Spotify podcast badge Apple podcast badge YouTube podcast badge

A dive into abstracting unusual behavior differences using generics to manage different storage and async usage styles

View the presentation

Audio

M4A

Download as M4A

Show Notes

Episode Sponsor: OneVariable

Transcript

James: My actual point for this week is to talk about the complicated things that I'm doing in BBQueue and to explain just enough of a background of BBQueue to make it understandable why these options are useful.

Amos: Sounds good. I'm listening.

James: Cool. So this is just going far enough with generics in BBQueue, which is probably too far, but I haven't thought of a better way, so I'm gonna call it 'just far enough.'

So, I have this library, BBQueue, which is a fancy single producer, single consumer-ish ring buffer.

So, you could use it as more than a single producer, single consumer, but it's one of those things where when you put data in, the first one to take it out is the one that gets it. So, sort of like Rust's channels, which are mostly MPSC, or multiple producer, single consumer. That's just so you don't have the case where one consumer is stealing all the messages from the other consumers, as opposed to like, a broadcast channel from Tokio.

But, BBQueue is a weird ring buffer. It's actually more like something called a Bip-Buffer, or a bipartite buffer. And the weird thing about it is that pushing and popping data is two stage. So instead of just saying, "Here's an item, push. Here's an item, push. Pop an item, pop an item," like a normal ring buffer that you might expect. It's actually two stages here.

The first one is to get a grant. So this is either on the pushing side or the producing side. You say, "I want a chunk to be able to put eight bytes in." So instead of putting eight things in one at a time, you go, "Give me room for eight."

And it goes, "Okay, here's your room for eight." Or it says, "I have no room for eight." Or on the consuming side, "Give me everything that you have available." Instead of just pop, pop, pop, pop.

The second stage of this is actually committing that grant or releasing the grant. So you might say, "Give me room for eight." I actually put six items in and then I say, "Okay, I only used six. Here are the six that I'm actually giving you." Or in a release, I might say, "There are 700 bytes available," and you go, "Cool. I took 140." And it goes, "Okay." And so this two stage thing is sort of like peeking into the buffer, either peeking into a write space or peeking into a read space.

But this is sort of the two stage process. You get a view of the internals of the ring buffer that you can either write or read from. And then you say, "I'm done with that. And I did read, or I did write this much data."

So why is it weird like this? It's weird like this for two reasons. One, if you are putting a ton of data, like 700 bytes into a queue, if you had to call push or pop 700 times, you're spending more time in the, like, overhead of coordination than you are actually pushing and popping data.

So if you know that you're going to be doing big chunks at a time, like a whole HTTP frame or a big line of console data or something like that. You might want to do that one chunk of at a time instead of paying that overhead on every step. The other thing is that these grants are actually a stable view of the storage of the ring buffer.

You are actually getting a mutable slice to, let's say the ring buffer is 4 kilobytes large, you'd actually be getting like a 512 bytes in place of the ring buffer. Which means that if you were trying to zero copy deserialize or something from that, you're not copying the data out into a vector or something and then deserializing from that.

The data was put into the ring buffer and then you are viewing the data where it was put in the ring buffer. On embedded systems too, this is awesome for DMA or direct memory access. When you say, "Hey, hardware peripheral, put 512 bytes of serial data right here." To do that, you need to give it a pointer and a length basically, and then promise that that memory is writable by this hardware peripheral until it says, "I'm done, you can have it now."

Which means that actually when you're receiving serial data or ethernet data, you can have the hardware put it exactly into the ring buffer without having to receive it into a buffer and then copy it into the ring buffer to make it available somewhere else. So I've said before that I solve a lot of problems with allocators.

This is a way of turning a ring buffer into sort of a self contained FIFO allocator.

Amos: That's what I was gonna say, it's not a ring buffer anymore, it's a full blown allocator.

James: Yeah, well, it's not full because it is still first in first out. A really cool thing about allocators is you could hold on to a message, process some other stuff, and then later release this message. BBQueue says: no. It has to go in this order and out this order. You can choose when you say I've processed this data, but it has to be in and out in that order all of the time. Because it's a ring. There's no way to free data in the middle of a ring.

Amos: Right, there's never any fragmentation.

James: Exactly. Yeah. And that's the benefit, is you never have fragmentation, because you can get a big chunk of like, "Give me 512 bytes," but then if you say, "I only used 128," that essentially frees the tail of the allocation, and that's what you're going to use next time.

Which means there exactly is no fragmentation, but the downside balance is you have to process it first in, first out. So, the problem is, this is a really useful data structure, and it's useful in a lot of different places to a lot of different people in a lot of different setups. The problem in the past is, I wanted to support all of those, and I didn't know how to do that without making however many copies of the full library that sort of work the same, and have the same vibes, but are tuned for different use cases.

Until, very recently. So I have a bunch of different ways that this can be used. And I'm going to go through each of them and why they're complicated and how I fixed it with generics.

So one is you might either want inline storage or heap storage. If you're on an embedded systems and you're statically allocating that 4k buffer or whatever, it's very easy to just say, "I want a static right here with 4k of storage there."

And the type system can know: oh, this is generic over const N: usize for the buffer size, and I can make a fancy buffer that's 4k.

Amos: Similar tricks are used for smart strings, like small string crates in Rust.

James: Yeah. But at runtime, if you go, "Well, I'm going to read a config file at startup, and I'm going to allocate a buffer based on that size," you can't tell the compiler, "Hey, I'm not going to know the actual type of this until runtime." So, you might want to put it in a heap, where you say, "Give me a boxed slice," where I can say: box slice new with, oh, this size that I read from the config file, which is 768 bytes.

Cool. I can do that. Now it seems like you might always want that, but the downside to that is you're forcing one layer of indirection. I'm going to the queue, where the data is in line. The bytes are at the end of that queue structure; versus in a box slice, I have to take a pointer to my queue, then bounce one more to my storage, which isn't super expensive, but you'd prefer not to pay for it if you didn't want to. And on embedded systems where you don't have a heap, it might not even be an option to do that.

Amos: And so that's your first layer of generics...

James: Yep. So this is the first configurable and we're going to play a fun game of binary combinatorics, but-

Amos: How many combinations, yes?

James: Yeah. So since this is an allocator, I have this trait 'storage,' where instead of giving you a slice, it gives you back a pointer or a NonNull u8. And a len, or a usize. And this is because when we get that storage buffer, we have to be really careful. Because there might be one thread that can see one chunk of the data, and another thread that can see a different chunk of the data. So we can't hand out a slice here, because then we'd have aliasing slices. So we have to act like an allocator and be really careful and go, "Here's the pointer and length. Do not make a slice of that entire length unless you know that it's safe."

Amos: Because if you did make a slice, like if you did have aliasing slices, that would be instant undefined behavior?

James: Yep, because if the producer has a view of 512 bytes of the 4k, and you made another slice of the whole 4k: boom, undefined behavior. Because you have aliased slices- or specifically aliased mutable slices to the same underlying buffer.

So this is my storage trait, and then I have sort of like an extension super trait, which is ConstStorage, where if you can make this at const time, like if you were defining a static, for example, because if it's just an inline buffer, you can define that as a static, but if it's a heap allocation, you cannot make it static. So this is how I say- doesn't matter how you store it: on the heap, as a static, in line, as a reference, even if you wanted. Here's the trait you have to fill, you have to promise me that this buffer never changes, and that this is the pointer to it, and this is the length of it, because I'm going to use that as my backing storage.

Amos: Right. So yeah, when you said inline, like heap versus inline, I was thinking of inline as the stack, but I didn't realize it could also be a static, as you say, which would then be, specifically a reserved part of the executable image that's then mapped into memory.

James: Yeah, it's a static global variable essentially, because if you- you could have a global variable that is your queue, and let's say there's the header of the queue that's 40 bytes or something like that. If you made the storage inline, you'd be literally putting the 4 kilobytes of buffer right next to that header. Like it is inline, they are both contained within the same struct. Versus in the non inline one, you have the 40 bytes of header. And then the storage that's sitting next to it is actually a pointer to somewhere else instead. So it's actually generic of whether the data is stored inside of the same struct, or whether the struct holds a box or something of data somewhere else.

Amos: The word intrusive data structures come to mind, but I'm not even sure it applies here, to be honest.

James: No, intrusive data structures are more link listy. This is almost, yeah, like I said, it's generic over whether the data is stored in the queue, or whether the queue holds a pointer to somewhere else.

Amos: Mmm..

James: So that's the first degree of freedom. The second option is async or not, because not everyone wants to use async. In embedded you may not use it, or if you're just using it for something blocking, or you have threads or something like that, and you're coordinating between them, you may or may not want async.

And async doesn't necessarily add a lot of overhead, and I'll show you how it might. But... it's a choice. If you don't use async, you don't necessarily want to pay any overhead for that, but if you do, you really want that and it's useful.

Amos: You mentioned an async memory allocator previously to me. Is that related? Would you want the BBQueue to be async because your memory allocator is async? Or what other asynchronous things would BBQueue even do?

James: Yeah. BBQueue out of the box: when it's not async, you can say, "Try give me a grant." Or, "Give me a grant and panic." Like, it's not like a channel in Rust where you can make the operating system block until you're ready.

Amos: Right.

James: You're saying like, "Give me a grant." Basically the only API I provide is: try give me a grant, and it'll either say, "Nope, no data available," or it'll say "Yes, there was data available, here you go."

Amos: It keeps surprising me that you use async for asynchronous things instead of, like everyone else, just using it for like timers and input output and- like, no you can just like, instead of blocking or failing, you can just hand out a future that will eventually resolve. It's natural, but I never, I'm always jump scared by it. I'm like, "Oh yeah, of course this is asynchronous, if you think about it."

James: But it's like Tokio Channel, for example, you could do 'try send' or 'try receive' and if for a bounded queue, there either is room or there isn't. Or if you have async, you can call .send().await or .recv().await, and that way it will wait until there either is data in the queue or the queue is empty enough to give you that.

So sort of like the async allocator, you would want to wait for the same reason: is there capacity or not? But this is, uh, yeah, so I guess it is more similar than I thought.

Amos: Yeah, but... I do want to point out that all the 'try send', 'try receive' methods are... pretty much just sugar over the regular async versions and then .now_or_never(), which is my favorite future combinator, which is like: if we poll the future once and then if it returns ready, then return, otherwise return option none or something.

James: It's funny, because mine's the opposite. My async ones are sugar over the 'try' ones, in that, if you- Let me just show you what the trait looks like. So I have a normal trait called Notifier, which is just wake_one_consumer, or wake_one_producer. So if you put data in the queue, you might want to wake_one_consumer. So if a producer puts data in, it wakes_one_consumer. And if the consumer takes data out, it might want to wake_one_producer, because now there might be room available for it.

But, I have, sort of, again, one of these extension traits that says: well, I have this async notifier trait, which requires that the type also be a notifier, but there's two main things that we're waiting on here.

Either, I need a future that tells me when the queue is not empty, or I need a future that tells me when the queue is not full. Cause if I want to put data in, I need to wait until the queue is not full. And if I want to take data out, I need to wait until the queue is not empty. So there's four futures I have here, but this is because when it comes to like atomic wakers, there's actually a two step process.

You need to put your waker into the space to be woken by the reactor. Then I check... so you want to say, like: okay, notify me if there is a time where the queue is not empty. Then you check if it's empty. If it is, you yield and wait to be woken, because you've already registered your waker. If it has what you want, you just immediately return on that.

So it's essentially a loop where it's like: install waker, check, await. Install waker, check, await. So mine's actually the opposite of what you described, where the native is a try, but by having this optional async notifier, we can have a method that's only available when you have an async notifier implementer that says, "Ah! But now I can also wait for that to happen."

Amos: It is also worth noting that I was wrong. Just look at the try receive methods for a UDP socket, for example. And they're just calling to the reactor. And then I actually know the difference. They're just calling it mio methods, because I knew it wasn't possible because to even pull the future once, you need some sort of runtime that you need to pass an awaker.

You can have a no-op waker, but... not a good idea if you're like- otherwise using it in an async context. Anyway, my whole remark was wrong, but it did prompt you to explain more about how your system works. So, still a win.

James: Yeah. Yeah... and the reason I have this two step process is because if you check if it's empty and then install the waker, because we have two threads that are coordinating, that's a race condition in that, like, you could check, then data becomes available, then you install your waker, which means you're too late for the notification, which means you have to go around, which is why I have this two step process, which is: install the waker, then check, then go around if not.

Amos: That's my favorite class of bug, which is called TOC/TOU. Time of check to time of use.

James: Exactly. And the reason that I have these futures for those two steps, for each wait_not_empty and wait_not_full, the reason that I make these associated types is because this allows you, on embedded systems where we don't necessarily care if our futures are send, because we don't have work stealing executors like Tokio, I can put a future here that is not send, and use it just fine.

And then if I'm using methods that use something Tokio based as the notifier, then I can force that type to send, which means it just sort of shakes out. So instead of using the regular async function in a trait, I have this sort of complicated four associated types of not empty register future, not empty waiter future, and not full register future, and not full waiter future. I can make those associated types, which means you can put different bounds on them, whether you're using Embassy's synchronization primitives or Tokio's synchronization primitives.

Amos: Right.

James: So that's the second variable. So, so far we have where the storage of the buffer is and whether we're async or not. The third one is, do we have atomics or not? Because I work on a lot of embedded systems that don't necessarily have what are called atomic compare and swap operations. So every CPU usually can load and store to some memory address within the width of its architecture in a single cycle. So on a 32 bit processor, you can store 32 bits or load 32 bits as one operation atomically. But when we do things like fetch_add, or swap, or compare and swap, those usually require some level of hardware support. On microcontrollers, these are usually what are called LL/SC instructions, which I can't remember what they are, but essentially, I can load and store atomically and expect that to work. And if I have those primitives, then LLVM can write me a function for like fetch_add or swap or something like that. Yeah, load-link/store-conditional. On Cortex-M there's specifically the LDREX and STREX instructions. Or on x86, you might have different ones which can do them indirectly.

There's a couple different ways processors do them, but essentially you can do things like swap a variable or fetch_add an integer without worrying about being preempted. You can do it as if you did it in one operation. There might be some spinning to do that, but as if it's transactional.

Amos: Right. Yep.

James: BBQueue is an algorithm that I wrote with some other folks of doing all of this, it's what's called a lock free data structure. There's no mutex provided by the system or operating system to do this. All this coordination on who has room to write or read is done with atomics. Which means that you don't necessarily need a blocking mutex or anything like that, that might play with it. You can do it totally uncoordinated or only coordinating through these atomics.

But there's a lot of cheap microcontrollers that you might want to use, like the Raspberry Pi 2040, which don't have support for those atomic instructions. Now in the embedded world, we have some way of like polyfilling these or backfilling them. Essentially, you take what's called a critical section, where you prevent yourself from being preempted and just do it.

But instead of emulating them, it doesn't necessarily make sense to emulate a bunch of atomic instructions. You go: well, if I'm going to need a mutex anyway, just take a mutex whenever I'm doing this coordination. And that way, it's only a couple pieces of like: add some numbers, check if this is bigger than this. If it is, write this data here, then do this. Like it's not a very complicated algorithm. So you can just do it in a couple of instructions inside of a critical section. This way, you don't have to get these compile time errors where you go, "Oh, you tried to use the swap instruction, which doesn't exist on this microcontroller."

Amos: Yeah.

James: So essentially all of these like primitives of what my queue needs to do, like negotiating between the two sides, like: do I have room to get a write grant? Or: do I have room to get a read grant? Or: I have done some stuff with my write grant, I am making it available to the reader. Or: I have done some stuff with my read grant and I'm recycling that space so it's now available to the writer. I essentially have one function for each of these coordination items.

Amos: Right.

James: Now this trait is unsafe because it's arbitrating access to shared buffer of memory. This is essentially, if you implement this trait wrong, you could have undefined behavior because you could accidentally give overlapping access. So this trait is unsafe because you have to do this right or you've busted your allocator soundness basically.

Amos: Yep. Yep. Yep.

James: So this is essentially all of the primitive coordination operations that my queue needs. And I can either do them with lock free atomics or I can do them with a more mutex-y critical section, whatever works better for my platform.

Amos: Mm hmm. And that's the third dimension of customization?

James: This is the third dimension. Yep. And here's the fourth.

Amos: Right. Here's the fourth. Oh boy.

James: So I talked before about whether the storage- that buffer of four kilobytes that we're putting data into- whether that is heap allocated or not.

But I didn't mention where are we putting the metadata. Because if I want to, let's say, use this on the desktop, and I want to have the traditional, like, split producer and consumer interface. We need to make sure that metadata and the buffer that it owns lives until both the producer and consumer have dropped their sides to it.

That data needs to live long enough for that. Now if I'm on embedded, and I place this as a global static, I don't have to worry. That data's static, it lives forever, it's just fine. But what if I'm on the desktop and I say, "I want to make this buffer hand out handles to two threads," they work for ten seconds or something like that.

So if you hand out that producer to one thread and the consumer to another thread, they live for a while and then you drop them, you want to make sure that data gets recycled at the right time. But like I said, if you are either using a heap allocation and just doing it locally, and you can borrow that heap allocation, that's cool. Or if you make a static and you borrow that to make my producer and consumer, that's cool too.

So this is another layer of abstracting of how the data is actually stored for the usage of this without making copies of the data structure.

So this trait is the most complicated because it has to know about the BBQueue type itself, which means it has to be generic over the other three things that we've talked about. So this is saying that we have a trait called BbqHandle, which is generic over the storage, coordination and notifier. And it gives me something that I can deref to a BBQueue so get a reference to this BBQueue. And I can get this reference and that reference is clonable. So if it's an arc, I can call clone on the arc. If it's a reference, I can just clone a reference because a reference is a reference and it means that I can clone that handle.

So what does this get me? I end up with like a producer and a consumer with generics that look like this.

So I have pub struct Producer generic over Q, S, C, N where S is Storage, C is Coord-ination, N is Notifier, and Q is my, where is the queue stored, or how do I get my BBQueue handle that is generic over S, C, N. And this producer and consumer just holds on to the queue. Whether it's an arc to it, or a reference to it. And I end up with this producer and consumer API that doesn't care how it's implemented, it just knows how to do these things.

Now, I said that I have four options here, with basically each of them are binary options. Now, when you have two to the fourth options, it means that I have sixteen versions of this. And because I love puns, usually whenever I talk about different versions in Postcard or whatever, I like calling them flavors.

Like, it's just an easier way for people to think of: I prefer this version of something, in the same way that you have preferences for different flavors. And because it's called BBQueue, I needed a theme for all 16 different flavors, which means that I had to find a list of 16 different global varieties of barbecue.

And I've given them type aliases because when you go through these combinatorics, you have to remember which ones.

And there's going to be some gimme's, like if you're on desktop with a heap, you're probably going to want a certain one. If you're on embedded with no atomics, you're probably going to want a different one. Like, for example, if you're on embedded with no atomics and no heap, you probably want Jerk, like jerk barbecue.

If you're on desktop and you have a heap allocator, but you want your data to be stored in line, you probably want Kansas City style barbecue. All these kind of things. And I do actually have a division where all the inline storage versions are North and South American styles of barbecue. And all the ones that have external storage are the rest of the world.

So we've got Europe, Asia, Africa; other countries that have barbecue styles and things like that. So... I don't know if everyone will love these combinations, but-

Amos: Important question: are you documenting those design choices in the Rust docs, or do you just hope people will pick up on them?

James: I've mentioned it, I do want to have like, and this is the same thing with a menu when you have 16 choices, people will be overwhelmed by the choices. I plan to have Chef's Choice in there of: Hey, you're on desktop? You probably want this one. Hey, you're on embedded? You probably want this one. You're async and embedded? You probably want this one.

Cause there's probably like three or four that everyone's going to use, and the rest of them are kind of weird niche combinations. But there's probably some reason to use them. But there's probably three or four that you're gonna want to pick and it's going to be pretty clear which one you want to use in these different cases.

Amos: Do you run unit tests for all 16 combinations?

James: I mean, I'm just barely implementing this at this point. So the answer is no. But it's interesting because the core of it is actually fairly independent. I don't necessarily know if I need 16 unit tests for this. I think you could unit test the different implementations of each of these traits. And as long as they follow the interface guaranteed by the trait, then the queue itself is guaranteed to work.

That's maybe a bold assumption, but it's one of those things where they are independent.

Amos: Sure. Right. Let's call them integration tests, then.

James: Yeah. Yeah. So right now I have some basic like smoke tests that are running with Miri where I do check some combinations of these, basically like the inline ones and the not inline ones.

And then the async ones and the not blocking ones. But I think it makes sense to have integration tests for all 16 of these, but... right now this was mostly: how far can I push generics to be able to have 16 versions of this library that I don't have to write 16 times with subtle copy and paste differences. Because that's what kept me from doing this for nearly two or three years since I last touched BBQueue is: I knew I wanted to support all these use cases but I couldn't figure out how to flex the type system in the right way that wasn't more complicated than it was worth. And I think I've come up with a pattern where the amount of complication gets you something worth it. And by having named varieties like this, that there's a chance that I can hide generics from people who don't need to worry about them, and just go: I just always use the Kansas City version of it. And I don't think about those four generic parameters. I just get a Kansas City queue and a Kansas City producer and a Kansas City consumer.

And I might have one macro that makes, you know, queue producer and consumer for all of these varieties. So that you get like real types out of them. But I'm okay with using a macro for that, versus having a macro that gets instantiated for 16 different full implementations of the queue.

Amos: Yeah, yeah, So this isn't on crates.io yet. And I've tried to find the source code. I haven't even been able to. When can people play with this, James?

James: It's on GitHub right now. So in my GitHub, it's BBQ2. So B-B-Q- the letters- 2. So not like “BBQueue”, not like the pun, but-

Amos: I see. So would that actually be a separate crates? Cause it's not even a major version bump. It's like a mega version bump, which SemVer does not account for?

James: Yeah, and this is another thing that I'm actually playing around with. I'm running into this same thing with Postcard of: how do you hack on a crate, or maybe iterate on it, where you know that the design is going to need to change fairly fundamentally. You know you are going to make a breaking change, but how do you make the two or three breaking changes up to that breaking change?

Amos: Right, yep.

James: I think within the world of Rust, I'm starting to think that maybe the best way to do that is to make a BBQ2 repo or a Postcard2 repo, iterate on there, make some breaking changes in that repo and then when the design is fairly clean, merge that into the upstream BBQueue or Postcard repo and release it as a breaking change in that repo. But that way if I need to make six or seven breaking changes while i'm still shaking out this design, it doesn't inhibit the development of the other crate. It does mean I'm going to have to kind of come in one day and just drop almost unrelated code on top of it but I mean... I can't think of a better way to do that incrementally, because there's no concept where I can have breaking changes other than maybe those like .alpha releases which don't work super well in cargo. Like, they work how they're documented to work, but that's not the way you want them to work necessarily.

Amos: I like that you have a separate repository, because sometimes you see projects doing that with branches. And the thing is that all branches except the default one are really obscured in the UI in GitHub. So, sometimes, you have to go to the branches tab and then you see: Oh, this one is 135 commits ahead of the main branch or something, then you backport everything.

James: It's like a versioned ng, like if it was BBQueue ng or something like that, so...

But yeah, I'm still playing around with this. The whole reason that I want this is for a network stack, and I'll get to that at some point in the research, but I realized part of my network stack could really use a queue shaped like BBQueue, because I want to be able to do hardware accelerated sending and receiving.

And I realized that BBQueue wasn't going to cut it in all the places that I wanted to use it, so it became time to come back and look at BBQueue, and it turns out with two or three years of learning how to do Rust better, I figured out how to solve that problem that had been frustrating me for a longer time than I'd like to admit.

Episode Sponsor

OneVariable is a consultancy focused on advising and development services in the areas of systems engineering, embedded systems, and software development in the Rust programming language. Do you need help building something in Rust? Check out onevariable.com/work to see if one of the specialties speaks to your needs.

Get in touch by emailing contact@onevariable.com.