Direct Memory Access for the Uninitiated

RSS podcast badge Spotify podcast badge Apple podcast badge

An introduction to DMA, including what it is commonly used for, and a model of how to think about what the hardware does

Audio

M4A

Download as M4A

Show Notes

Episode Sponsor: The Embedded Working Group Community Micro Survey

You can take the survey now by clicking here.

Transcript

James Munns: Cool. Alright, so this is direct memory access for the uninitiated. I'm gonna skip over a lot of details, because this is a very deep rabbit hole we can go into.

But I've had a couple people asking me about DMA recently, and uh, the audience can't see it, but looking on the face of Amos right now, I'm guessing he has questions about DMA, so we're gonna get into this.

Amos Wenger: I'm just, I'm chuckling because it feels like this is one of those topics. Like, "But first we have to explain the world," and the very first thing you said being like, "We're going to skip over most of it." It's just amazing. Like, I'm very happy because I'm going to get to see how you navigate this. So please, by all means, go ahead.

James Munns: Yeah, I've tried to pick my battles and we'll see what questions you ask.

Amos Wenger: Let's remember those words. Yes.

James Munns: So you've probably heard the term DMA and DMA stands for direct memory access. This is used pretty much across all of computing, all the way down to the little microcontrollers to your desktop and things like that.

DMA broadly is a thing that's applicable to all of this. But before I explain what it is. I should probably explain memory access for someone who's only ever programmed something on a desktop or something like that. You kind of go, well, my CPU, I have my RAM, I have my hard drive and stuff like that.

It's easy to not realize that your computer is a whole network of computers basically. And even something as simple as accessing memory is almost more like talking to a server over the internet than... I can't even think of a simpler metaphor than that.

Amos Wenger: Yeah, I was going to say to say. "What's your metaphor here?"

James Munns: Yeah, I have- I'm left with nothing.

Amos Wenger: The simplest of things is actually more complicated than you think, so good luck.

James Munns: Yeah. So let's say we have a system where we've got two CPU cores and some memory. And I'm not going to talk about caches or anything like that. We're going to pretend those don't exist, cause that's another rabbit hole to go down, but we've got two CPUs and they're connected to the memory through something that's called a memory bus.

So this is the actual network connection more or less between your CPU and your memory. And this is useful because you might have multiple chunks of memory. So you might have multiple sticks and each stick of memory has modules on it and things like that. And if you have two CPUs and they both want to access the same memory at the same time, they can't.

They need to arbitrate that access and figure out: Hey, I want to access this memory. Where is that memory? Okay. It's on this chip over here. I need to go get that. And I need to read or write from it. And if two of them want to do that at the same time, something needs to control: okay, you go first and then you go second.

So this memory bus is about connecting to the memory, but also arbitrating the access to this memory. Now we're still working in the model where like, we're talking about a specific pointer address. Again, I'm not going to get into virtual memory, which is a whole other layer, but let's say we have the actual physical memory address: address 40000 whatever, we can say that maps to this specific part of this chip.

And if both of them want to touch it at the same time, they have to negotiate for that.

Now, memory isn't the only thing that your CPU talks to. Microcontrollers have a ton of little peripherals for talking to serial ports, or timers, or a bunch of useful accessories that you might want, and your desktop CPU is going to have something kind of like this.

It could be for things like serial ports, it might be talking for sensors like temperature sensors on your motherboard, it could even be for talking to things that are relatively slow like USB. So compared to your CPU and your main memory, USB- especially like older USB 2- is way slower than this. So you might have a separate bus called the peripheral bus, where all those things like serial ports are connected.

Now we usually map these also in memory. So there might be a specific pointer, physical address that this pointer goes to. And when your system talks to memory that is real memory, it goes: Oh, okay, this range of addresses live on the memory bus, and this range of addresses live on the peripheral bus.

So when I'm talking to this memory, I'm talking to my DDR memory on the motherboard, and when I'm talking to this memory, I'm talking to a USB port or a serial port or something like that.

Amos Wenger: So this is all happening at the hardware level? Like it's not a feature of the kernel, like memory-mapped files? It's actually like the hardware knows this range is for this device.

James Munns: Yeah, this is one of those where there's a lot of layers of abstraction and a lot of systems handle this very differently. For example, on some systems, the peripheral bus might be exposed through the memory bus. Like, it might not be one or the other. You might have to go through one to another. But generally: Yes, at a hardware level, this is exposed as a memory address, and then it's exposed to higher levels of the software. So what your kernel would actually interact with would be some memory address where the serial port lives.

Amos Wenger: Gotcha.

James Munns: Then, once we go to user space and stuff, it's abstracted and abstracted and abstracted, but at the actual hardware level, there's generally a- like a memory address that you can go to to talk to something.

Now, the problem is- well, your CPU is incredibly fast. Your memory is reasonably fast. It's closer to the CPU speed than something else, but peripherals like a serial port or something like that, or even USB 2.0 are so many, maybe a million times slower than what your CPU can do or your main memory can do. Which means when we're talking to them, this can be painful because just setting a value to memory and reading a value to memory might be very quick, but if you want to write some data over a serial port, that takes millennium, perceptually, to the CPU. It feels forever.

Now, a modern desktop, so we're in 2024, a modern desktop, like a MacBook Pro, might have a memory bandwidth- the speed limit between your CPU and the RAM- might be in hundreds of gigabytes. So a high end MacBook Pro will have maybe 400 gigabytes per second bandwidth between the CPU cores and the main memory.

Today's modern microcontrollers like a Raspberry Pi RP2040 or something like that might have hundreds of megabytes of bandwidth. So talking to its memory, which again, works totally different, but it might be able to read or write hundreds of megabytes per second at a reasonable clock speed.

Now if we have a serial port, like something that you might use to hook up to a very old piece of control equipment or something like that, is measured in baud, or symbols per second, and at 115200 baud, which is a very common serial port speed that you might have on your computer, if your computer still has a serial port, is 11.25 kilobytes a second. So we're like five orders of magnitude slower than our microcontroller, and like, ten orders of magnitude slower than our MacBook Pro. So this is like a huge mismatch of like if I wanted to send one byte over the serial port this takes as long as it would take me to stream like a whole movie to my main CPU.

Amos Wenger: I know I look young, but I am old enough that I have owned peripherals that were connected over parallel port and serial port. And I'm now wondering if like the scanners of the time, for example, were maybe limited by the speed of the connection rather than the actual, I don't know, the speed of scanning.

James Munns: Yeah, yeah for sure. And a lot of serial devices didn't even run this fast. Like 115200 is a reasonably quick serial port speed. Some of them are down at 9600 baud, which is like a kilobyte a second basically. So it can get slower than this for sure.

Amos Wenger: And for a while, internet access speed was directly related to that, I believe? Like the first modems, like 56k, whatever. I haven't had anything slower.

James Munns: Some of them were. We're in this order of magnitude, like old dial up- if you talk about 56k modems, that would be five times faster than this serial port I'm describing. But also like first gen internet might've been at this speed, but we're at the order of magnitude of dial up, if that makes sense.

Amos Wenger: Yeah.

James Munns: So we have this sort of setup where we've got the connection between our memory, like where we have the data we want to send over the serial port, and the peripheral, that has a speed limit of gigabytes per second, or hundreds of gigabytes per second. And then the speed limit between our serial port peripheral, like the little hardware accelerator on our chip that handles serial port stuff, sending that actually over the physical wire, like the electrical signals over the wire, that has a speed limit of 11 kilobytes.

So we have to figure out how do we not make our whole system slow to a stop every single time we want to send some data over the serial port. And also make sure that that data we're sending over the serial port is smooth. So if we send one byte and then go off and do something else, we want to make sure that next byte to go out on the wire is ready so there's no pause in between each byte because that'll slow us down even more. So we want that to be like smooth data transfer as well.

So if we were to write, like- this is all very fake, but not far off from what you would write in a microcontroller, if we were to write a blocking send function- for each byte that we send, we might do a for loop where we say: Okay, check if there's an error with the hardware, because there might be an error with the hardware, and then we wait until the hardware says, "I'm ready to take some more data to put out on the wire."

Which means we're just sitting there in a while loop, waiting, just burning CPU cycles. And then finally when it says, "I'm ready for one byte," we give it one byte, and then we go back to waiting, and we just sit there and wait forever. And like I said, it might be millions or billions of CPU cycles for each byte that goes out over the wire.

Amos Wenger: This is funny because I'm looking at your pseudocode and it's Rust. And in Rust we have sum types. So we have results with like an OK and error variant. But in hardware, errors are like: this bit somewhere in a register or whatever flips when something is wrong and you have to check it all the time and this is, you're just busy looping instead of doing anything asynchronous. But yeah. This is fun to see in Rust.

James Munns: Yeah, and we do that a lot in embedded Rust is we turn, you know, you might have six different bits. Imagine like error LEDs on your Wi-Fi radio or something like that. We turn those into an enum value, but yeah, exact same kind of thing. But we have to just literally check: is there an error?

Now, this is awful because like I said, we're burning millions or billions of CPU cycles potentially just waiting, doing nothing, which is incredibly wasteful. So this is where DMA comes in.

DMA is for babysitting memory copies. It lets us delegate that responsibility of: here are 600 bytes I would like to send over a serial port.

And your CPU just goes: here are the 600 bytes. DMA, please send this to the serial port and let me know when you're done. And the way that these actually look from a hardware level, and again this is one of those things where if you look at 10 chips they might implement DMA 10 different ways, but conceptually you can think of them like a very, very, very simple CPU core who can only do one thing.

And that's copy memory from one place to another. So it's not like your desktop CPU, where, you know, you have a whole instruction set, X86 or ARM or whatever that can do tons of different things. DMA is like a CPU core who's babysitting memory copies. So, the way this like delegation works is your CPU will have some data in memory- so like that 600 bytes of serial data you want to send. And you'll have a pointer and a length- basically a slice. So your hardware conceptually will say, "Starting address is here, and then the next 600 bytes are what I want to send over the serial port." So you hand those two pieces of information to DMA, maybe configure it somehow, and then you say, "Go."

And it goes, "Okay, I'll let you know when I'm done." And it goes off and does it. And then at some point later, you get a notification or an event or something that says, "DMA transfer complete!" So you get sort of a, on a microcontroller, it might be an interrupt. On a more desktop piece of hardware, it might be some kind of event or something like that, but you'll get a notification or an event sometime later that says, "This happened."

So you went off and we're doing something else for a billion CPU cycles. And it said, "I'm done, thanks!"

Amos Wenger: So why is it called direct? It doesn't seem direct to me!

James Munns: It's direct in that it allows something like a serial port to feel like it's directly pulling the data from memory. So instead of the CPU having to go and poke every byte of memory into the peripheral, it gives you sort of the appearance that this peripheral is directly pulling the bytes out of memory to put them on the wire.

Amos Wenger: Gotcha.

James Munns: So this is really awesome because we go from "busy polling" to "event driven," which means we're not just sitting there checking status bits like you were saying: we're waiting for a notification which frees our hands up to go do something else.

Which, if you've heard of async and Rust before, this is what we love in async. We don't like busy polling. We don't like blocking. We don't like syscalls and things like that. We want an event. We want to be notified when something is done. We want to be notified when a packet arrives or a packet is finished sending.

So the CPU can go, "Okay, it's time to do more stuff now." This is essentially, we're awaiting a signal at the hardware level.

So an async version of this that has a similar signature, like send where we give it a slice of bytes or something like that- we might configure the DMA, so like dma.setup(), we give it the source pointer, the destination pointer, and we say, "You're gonna be transferring this to the serial port."

So here's our destination address we write on the envelope and give to the DMA engine, and we tell it to run, and we call .await on what it gives us back. And that's gonna allow us to go off and do something else until our reactor or the interrupt or whatever comes back and says, "Hey, kick the waker. We've finished this, so now it's time for this async function to come back and do whatever." It might check the error and say, "Did that DMA transfer complete successfully or no?" And if it didn't fail, then we go, "Okay, cool. Transfer done." Which meant the CPU didn't have to do any more work than setting it up, letting it run. It gets notified, it comes back and it checks it. So it didn't busy wait for those billion cycles. It went off and did something.

Amos Wenger: So I know you're gonna say, "It depends on whether you're running on a high end desktop computer or a tiny microcontroller," but is the DMA controller or whatever implementation an actual separate core of something? Probably not the same as your other CPU cores, or is it just a function of the CPU? Is it baked into the CPU design in some way? Is it a separate chip? I don't know.

James Munns: Yeah. So, DMA is largely, it's more of a pattern than a concrete thing. So I mean, it is 'it depends,' but it will be a chunk of silicon on your core.

Amos Wenger: I knew it!

James Munns: So whether it's on the CPU or on the motherboard or something like that, it will be wired up somewhere where it's a discrete thing. Like you might have multiple CPU cores on one CPU die or like the actual CPU unit.

It'll have some DMA cores on it as well on the silicon usually, because it has to be connected to the same memory and the same peripherals that you have on your system. Where it actually is, whether it's in the CPU box or on the motherboard somewhere- that's where we get into 'it depends,' but functionally it's going to be co located kind of like wherever your CPUs are.

One more neat thing. So I've mentioned that DMA is great for transferring data from main memory into a peripheral, but DMA's job is to copy memory from one place to another. And when it's copying to peripherals, it can be very smart because it can kind of babysit that. "Oh, peripheral, are you ready to receive some more data?" It says, "Not yet." So DMA goes, "Okay, I'll wait." And then the peripheral knows to directly tell the DMA, "Hey, I'm ready for more memory now, give me memory!" So DMA even has its own sort of event driven thing. So it's got a bit more smarts when it comes to talking to peripherals, but at the end of the day, it's copying bytes from memory to somewhere else.

Now that could be an address that's a peripheral. But it could also be your main memory. And you might think, "Why would I want to use DMA to copy memory to memory?" Like the nice thing about a peripheral is the DMA can handle the slower speed limit. So it can be delegated to babysit that slow transfer.

But if we're already at like the peak of memory speed, memory to memory, like the CPU is loading from here- why would we want to use DMA when it's going to be the same speed as our CPU, basically. They're both going to be racing at the maximum speed of memory access.

Well, if we were to implement memcpy with like a source slice of bytes and a destination slice of bytes- this is sort of recursive, this fake code that I have on the screen, because copy_from_slice is actually implemented with memcpy, but if we had something that was kind of memcpy-like, where we wanted to copy from one chunk of memory to another one- we could do that and it would go very fast. Our CPU would load some memory and write some memory. But we're using essentially like the fanciest scientific vehicle as a pickup truck at this point. Like we're using this phenomenally capable CPU to just copy bytes from one place to another.

Which is sort of a waste as it were.

If instead we were to use DMA and we go, "Okay, well, we want to copy this gigabyte of data from this buffer to this buffer," because maybe we're copying it to another file or a- something we're going to do encoding with or something like that. We could just use DMA for that as well. So we can give it the source pointer and len and the destination pointer and len and we say, "DMA- please go off and copy this memory and let me know when you're done." So we don't get any speed benefit, or we're not really slowed down by this, but then all of a sudden we get that nice event-driven capability just for copying big chunks of memory.

And this allows us to write, essentially, async memcpy, if we really wanted to. Which is a bonkers thing to think about. But I like turning weird, abstracted-away things that you normally don't handle yourself in user space, and doing them in async because it lets you address the hardware itself. And if you were really copying a big chunk of memory, it would free up your CPU core to go off and do something.

I don't know how this would work in a kernel and the kernel might have something like this. I mean, the kernel probably is doing something like this under the hood. It's not in Rust async, but it probably has some event-driven memcpys where it goes, "I'm not going to sit around and copy a gigabyte of video file from here to there. I'm going to ask the hardware to do it," so the kernel can go off and do something, but-

Amos Wenger: That seems likely.

James Munns: It's fun to think of it as a user space async function instead.

Amos Wenger: It feels like most of your presentations are about, "You thought of this thing as blocking? Well, I'm going to tell you how to make it async!"

James Munns: This is one of those weird things of peering through all the layers. Because microcontrollers- firmware- is essentially the same thing as a kernel. Like a kernel is just managing the hardware for you. And a microcontroller project is really just: I'm managing all the firmware, but then also doing some business logic directly on top of it, but it's baked together.

But kind of zooming out through embedded systems to designing an operating system to user space is always fun because you go, "Where are all the lies?" Because all the layers, it's one of those things like: well, all models are lies, but some of them are useful. And in computing, we've just built up stacks and stacks of stacks of these useful models and useful lies. The hardware lies to you at multiple layers. This is what like virtual memory is and TLBs and caches and things like that is the hardware lying to you.

And then that talks to the kernel. And the kernel lies to you in a bunch of ways of abstracting away virtual memory and things like that. And then using one layer out of like what you can actually look at in user space and what it means to block or not block. I really do love just like punching holes through that stack of like- how could you express the actualities of the hardware at a syntax layer of a language if you had to pretend that all those layers of lies didn't exist anymore.

Amos Wenger: Can I just, can I close with a little anecdote I love about implementing languages?

James Munns: Mm hmm.

Amos Wenger: So you showed, you know, on a very visual medium, you showed a slide that showed memcpy calling copy_to_slice or something, and you mentioned that this would probably be a recursive. But did you know that when you're making a libc and you're using a C compiler, when you feed it something that looks like memcpy, it will replace it with a memcpy intrinsic, which is a problem if you're writing the function that ends up being called by that intrinsic. So there are some cases.

James Munns: You run into this all the time in embedded systems, because a lot of memcpys provided by compilers are optimized for speed, but on your microcontroller, you might want to optimize for size, and optimizers love turning things into memcpy. Like, that's an optimizer's favorite thing to do, is go, "That's a memcpy," because you hope that memcpy is one of the most optimized primitives that you could ever have, but yeah, trying to get compilers to, one, not turn things into memcpys, and two, telling them, "Hey, here is the memcpy. I'm bringing it myself." Yeah. That happens all the time where you end up like the linker just explodes because it goes, wait a minute. What?

Episode Sponsor

Today's episode was brought to you by the The Embedded Working Group Community Micro Survey.

The Rust Embedded Working Group is running a community survey to learn more about the people using Embedded Rust for hobby, university, and production usage. The survey is anonymous, should take less than five minutes, and your response helps us out a ton. The survey will be available until September 19th 2024.

You can take the survey now by clicking here.