WEBVTT

NOTE
This file was generated by Descript <www.descript.com>

00:00:13.212 --> 00:00:15.952
<v Amanda Majorowicz>This is Self-
Directed Research, where our hosts,

00:00:16.012 --> 00:00:19.192
James and Amos, give each other 20
minute presentations that usually

00:00:19.192 --> 00:00:22.362
last longer than that about whatever
they've been obsessing over lately.

00:00:22.782 --> 00:00:27.722
For more episodes, show notes, and
transcripts, visit sdr-podcast.com.

00:00:28.122 --> 00:00:30.022
New episodes drop every Wednesday.

00:00:30.572 --> 00:00:32.462
<v James Munns>Today's episode is
brought to you by the Embedded

00:00:32.462 --> 00:00:34.202
Working Group Community Microsurvey.

00:00:34.530 --> 00:00:38.900
Check out the description or our
show notes at sdr-podcast.com

00:00:38.920 --> 00:00:39.960
for a link to the survey.

00:00:40.580 --> 00:00:42.210
More information at
the end of the episode.

00:00:43.120 --> 00:00:46.570
This week, James presents direct
memory access for the uninitiated.

00:00:51.471 --> 00:00:51.761
Cool.

00:00:52.045 --> 00:00:55.085
Alright, so this is direct memory
access for the uninitiated.

00:00:55.705 --> 00:00:58.865
I'm gonna skip over a lot of
details, because this is a very

00:00:58.875 --> 00:01:00.565
deep rabbit hole we can go into.

00:01:00.985 --> 00:01:05.495
But I've had a couple people asking
me about DMA recently, and uh, the

00:01:05.495 --> 00:01:09.007
audience can't see it, but looking
on the face of Amos right now, I'm

00:01:09.007 --> 00:01:12.587
guessing he has questions about
DMA, so we're gonna get into this.

00:01:12.662 --> 00:01:16.068
<v Amos Wenger>I'm just, I'm
chuckling because it feels like

00:01:16.068 --> 00:01:16.978
this is one of those topics.

00:01:16.978 --> 00:01:21.268
Like, "But first we have to explain
the world," and the very first

00:01:21.268 --> 00:01:24.618
thing you said being like, "We're
going to skip over most of it."

00:01:24.828 --> 00:01:25.648
It's just amazing.

00:01:25.668 --> 00:01:28.908
Like, I'm very happy because I'm going
to get to see how you navigate this.

00:01:28.918 --> 00:01:30.298
So please, by all means, go ahead.

00:01:30.301 --> 00:01:32.793
<v James Munns>Yeah, I've tried
to pick my battles and we'll

00:01:32.793 --> 00:01:34.183
see what questions you ask.

00:01:34.368 --> 00:01:35.358
<v Amos Wenger>Let's remember those words.

00:01:35.378 --> 00:01:35.678
Yes.

00:01:36.323 --> 00:01:39.915
<v James Munns>So you've probably
heard the term DMA and DMA

00:01:39.925 --> 00:01:41.955
stands for direct memory access.

00:01:41.995 --> 00:01:45.925
This is used pretty much across
all of computing, all the way down

00:01:45.925 --> 00:01:48.515
to the little microcontrollers to
your desktop and things like that.

00:01:48.815 --> 00:01:53.345
DMA broadly is a thing that's
applicable to all of this.

00:01:53.845 --> 00:01:55.325
But before I explain what it is.

00:01:55.845 --> 00:01:58.585
I should probably explain memory
access for someone who's only

00:01:58.585 --> 00:02:01.165
ever programmed something on a
desktop or something like that.

00:02:01.475 --> 00:02:04.835
You kind of go, well, my CPU,
I have my RAM, I have my hard

00:02:04.835 --> 00:02:05.825
drive and stuff like that.

00:02:06.355 --> 00:02:10.744
It's easy to not realize
that your computer is a whole

00:02:10.754 --> 00:02:12.364
network of computers basically.

00:02:12.544 --> 00:02:18.014
And even something as simple as accessing
memory is almost more like talking

00:02:18.014 --> 00:02:20.399
to a server over the internet than...

00:02:20.850 --> 00:02:22.710
I can't even think of a
simpler metaphor than that.

00:02:23.197 --> 00:02:23.967
<v Amos Wenger>Yeah, I
was going to say to say.

00:02:24.147 --> 00:02:25.207
"What's your metaphor here?"

00:02:25.270 --> 00:02:26.859
<v James Munns>Yeah, I have-
I'm left with nothing.

00:02:26.907 --> 00:02:28.597
<v Amos Wenger>The simplest of
things is actually more complicated

00:02:28.597 --> 00:02:29.847
than you think, so good luck.

00:02:30.047 --> 00:02:30.497
<v James Munns>Yeah.

00:02:30.547 --> 00:02:35.940
So let's say we have a system where
we've got two CPU cores and some memory.

00:02:36.290 --> 00:02:38.760
And I'm not going to talk about
caches or anything like that.

00:02:38.910 --> 00:02:41.953
We're going to pretend those don't
exist, cause that's another rabbit hole

00:02:41.953 --> 00:02:46.213
to go down, but we've got two CPUs and
they're connected to the memory through

00:02:46.213 --> 00:02:48.253
something that's called a memory bus.

00:02:48.573 --> 00:02:51.813
So this is the actual network
connection more or less between

00:02:51.813 --> 00:02:53.073
your CPU and your memory.

00:02:53.653 --> 00:02:57.383
And this is useful because you might
have multiple chunks of memory.

00:02:57.383 --> 00:03:00.463
So you might have multiple sticks
and each stick of memory has

00:03:00.473 --> 00:03:02.233
modules on it and things like that.

00:03:02.573 --> 00:03:06.023
And if you have two CPUs and they
both want to access the same memory

00:03:06.023 --> 00:03:08.168
at the same time, they can't.

00:03:08.251 --> 00:03:11.491
They need to arbitrate that
access and figure out: Hey,

00:03:11.491 --> 00:03:12.761
I want to access this memory.

00:03:12.781 --> 00:03:14.101
Where is that memory?

00:03:14.251 --> 00:03:14.601
Okay.

00:03:14.601 --> 00:03:15.921
It's on this chip over here.

00:03:15.951 --> 00:03:16.901
I need to go get that.

00:03:17.111 --> 00:03:18.471
And I need to read or write from it.

00:03:18.491 --> 00:03:22.391
And if two of them want to do that at the
same time, something needs to control:

00:03:22.391 --> 00:03:24.701
okay, you go first and then you go second.

00:03:25.141 --> 00:03:30.241
So this memory bus is about
connecting to the memory, but also

00:03:30.401 --> 00:03:33.161
arbitrating the access to this memory.

00:03:33.981 --> 00:03:36.341
Now we're still working in the
model where like, we're talking

00:03:36.341 --> 00:03:38.261
about a specific pointer address.

00:03:38.271 --> 00:03:40.431
Again, I'm not going to get into
virtual memory, which is a whole

00:03:40.491 --> 00:03:45.011
other layer, but let's say we have the
actual physical memory address: address

00:03:45.391 --> 00:03:51.651
40000 whatever, we can say that maps
to this specific part of this chip.

00:03:51.941 --> 00:03:54.341
And if both of them want to
touch it at the same time, they

00:03:54.341 --> 00:03:55.831
have to negotiate for that.

00:03:56.181 --> 00:03:56.781
Now,

00:03:57.034 --> 00:03:59.634
<v James Munns>memory isn't the
only thing that your CPU talks to.

00:04:00.144 --> 00:04:03.374
Microcontrollers have a ton of little
peripherals for talking to serial

00:04:03.374 --> 00:04:08.604
ports, or timers, or a bunch of
useful accessories that you might

00:04:08.604 --> 00:04:12.754
want, and your desktop CPU is going
to have something kind of like this.

00:04:13.070 --> 00:04:16.620
It could be for things like serial ports,
it might be talking for sensors like

00:04:16.640 --> 00:04:20.620
temperature sensors on your motherboard,
it could even be for talking to things

00:04:20.620 --> 00:04:23.030
that are relatively slow like USB.

00:04:23.230 --> 00:04:26.825
So compared to your CPU and your
main memory, USB- especially like

00:04:26.835 --> 00:04:29.855
older USB 2- is way slower than this.

00:04:30.215 --> 00:04:34.525
So you might have a separate bus called
the peripheral bus, where all those

00:04:34.525 --> 00:04:36.315
things like serial ports are connected.

00:04:37.105 --> 00:04:39.395
Now we usually map these also in memory.

00:04:39.565 --> 00:04:42.655
So there might be a specific
pointer, physical address

00:04:42.685 --> 00:04:43.995
that this pointer goes to.

00:04:44.365 --> 00:04:50.451
And when your system talks to memory
that is real memory, it goes: Oh,

00:04:50.451 --> 00:04:53.861
okay, this range of addresses live
on the memory bus, and this range of

00:04:53.871 --> 00:04:55.711
addresses live on the peripheral bus.

00:04:55.721 --> 00:04:59.801
So when I'm talking to this memory,
I'm talking to my DDR memory on the

00:04:59.801 --> 00:05:03.111
motherboard, and when I'm talking to
this memory, I'm talking to a USB port

00:05:03.141 --> 00:05:04.931
or a serial port or something like that.

00:05:05.281 --> 00:05:07.211
<v Amos Wenger>So this is all
happening at the hardware level?

00:05:07.221 --> 00:05:10.021
Like it's not a feature of the
kernel, like memory-mapped files?

00:05:10.021 --> 00:05:13.561
It's actually like the hardware
knows this range is for this device.

00:05:14.336 --> 00:05:16.756
<v James Munns>Yeah, this is one of
those where there's a lot of layers

00:05:16.756 --> 00:05:20.716
of abstraction and a lot of systems
handle this very differently.

00:05:20.996 --> 00:05:23.206
For example, on some systems,
the peripheral bus might be

00:05:23.206 --> 00:05:24.806
exposed through the memory bus.

00:05:24.836 --> 00:05:26.526
Like, it might not be one or the other.

00:05:26.526 --> 00:05:28.216
You might have to go
through one to another.

00:05:28.486 --> 00:05:33.421
But generally: Yes, at a hardware
level, this is exposed as a memory

00:05:33.431 --> 00:05:36.471
address, and then it's exposed
to higher levels of the software.

00:05:36.471 --> 00:05:39.861
So what your kernel would actually
interact with would be some memory

00:05:39.861 --> 00:05:41.721
address where the serial port lives.

00:05:41.721 --> 00:05:42.111
<v Amos Wenger>Gotcha.

00:05:42.401 --> 00:05:44.701
<v James Munns>Then, once we go to user
space and stuff, it's abstracted and

00:05:44.701 --> 00:05:48.571
abstracted and abstracted, but at
the actual hardware level, there's

00:05:48.581 --> 00:05:51.991
generally a- like a memory address that
you can go to to talk to something.

00:05:52.501 --> 00:05:55.831
Now, the problem is- well,
your CPU is incredibly fast.

00:05:56.341 --> 00:05:59.542
Your memory is reasonably fast.

00:05:59.542 --> 00:06:04.112
It's closer to the CPU speed than
something else, but peripherals like

00:06:04.112 --> 00:06:09.092
a serial port or something like that,
or even USB 2.0 are so many, maybe a

00:06:09.092 --> 00:06:13.652
million times slower than what your
CPU can do or your main memory can do.

00:06:14.012 --> 00:06:17.112
Which means when we're talking to
them, this can be painful because

00:06:17.382 --> 00:06:21.772
just setting a value to memory and
reading a value to memory might be very

00:06:21.772 --> 00:06:25.192
quick, but if you want to write some
data over a serial port, that takes

00:06:25.512 --> 00:06:27.992
millennium, perceptually, to the CPU.

00:06:27.992 --> 00:06:29.202
It feels forever.

00:06:30.362 --> 00:06:35.432
Now, a modern desktop, so we're in
2024, a modern desktop, like a MacBook

00:06:35.462 --> 00:06:40.079
Pro, might have a memory bandwidth- the
speed limit between your CPU and the

00:06:40.079 --> 00:06:42.469
RAM- might be in hundreds of gigabytes.

00:06:42.479 --> 00:06:48.075
So a high end MacBook Pro will have
maybe 400 gigabytes per second bandwidth

00:06:48.105 --> 00:06:51.425
between the CPU cores and the main memory.

00:06:52.745 --> 00:06:56.905
Today's modern microcontrollers
like a Raspberry Pi RP2040 or

00:06:56.905 --> 00:07:00.645
something like that might have
hundreds of megabytes of bandwidth.

00:07:00.655 --> 00:07:03.485
So talking to its memory, which
again, works totally different,

00:07:03.495 --> 00:07:06.905
but it might be able to read or
write hundreds of megabytes per

00:07:06.905 --> 00:07:09.205
second at a reasonable clock speed.

00:07:10.395 --> 00:07:13.945
Now if we have a serial port, like
something that you might use to hook up

00:07:13.945 --> 00:07:17.315
to a very old piece of control equipment
or something like that, is measured

00:07:17.315 --> 00:07:23.635
in baud, or symbols per second, and at
115200 baud, which is a very common serial

00:07:23.635 --> 00:07:27.275
port speed that you might have on your
computer, if your computer still has a

00:07:27.275 --> 00:07:31.075
serial port, is 11.25 kilobytes a second.

00:07:31.385 --> 00:07:35.625
So we're like five orders of magnitude
slower than our microcontroller,

00:07:35.655 --> 00:07:40.239
and like, ten orders of magnitude
slower than our MacBook Pro.

00:07:40.472 --> 00:07:44.862
So this is like a huge mismatch of
like if I wanted to send one byte

00:07:44.872 --> 00:07:47.882
over the serial port this takes as
long as it would take me to stream

00:07:47.882 --> 00:07:50.952
like a whole movie to my main CPU.

00:07:51.285 --> 00:07:53.545
<v Amos Wenger>I know I look young,
but I am old enough that I have

00:07:53.555 --> 00:07:57.235
owned peripherals that were connected
over parallel port and serial port.

00:07:57.245 --> 00:07:59.995
And I'm now wondering if like the
scanners of the time, for example,

00:08:00.145 --> 00:08:04.655
were maybe limited by the speed of
the connection rather than the actual,

00:08:04.675 --> 00:08:05.855
I don't know, the speed of scanning.

00:08:06.362 --> 00:08:07.272
<v James Munns>Yeah, yeah for sure.

00:08:07.542 --> 00:08:10.462
And a lot of serial devices
didn't even run this fast.

00:08:10.682 --> 00:08:14.892
Like 115200 is a reasonably
quick serial port speed.

00:08:14.892 --> 00:08:21.482
Some of them are down at 9600 baud, which
is like a kilobyte a second basically.

00:08:21.502 --> 00:08:23.402
So it can get slower than this for sure.

00:08:23.677 --> 00:08:27.037
<v Amos Wenger>And for a while,
internet access speed was directly

00:08:27.037 --> 00:08:28.527
related to that, I believe?

00:08:28.527 --> 00:08:30.837
Like the first modems, like 56k, whatever.

00:08:31.167 --> 00:08:32.447
I haven't had anything slower.

00:08:32.610 --> 00:08:33.440
<v James Munns>Some of them were.

00:08:33.457 --> 00:08:37.442
We're in this order of magnitude, like
old dial up- if you talk about 56k

00:08:37.962 --> 00:08:42.512
modems, that would be five times faster
than this serial port I'm describing.

00:08:42.572 --> 00:08:46.282
But also like first gen internet
might've been at this speed, but

00:08:46.332 --> 00:08:50.172
we're at the order of magnitude
of dial up, if that makes sense.

00:08:50.283 --> 00:08:50.463
<v Amos Wenger>Yeah.

00:08:51.081 --> 00:08:54.081
<v James Munns>So we have this sort of
setup where we've got the connection

00:08:54.081 --> 00:08:57.318
between our memory, like where we have
the data we want to send over the serial

00:08:57.318 --> 00:09:01.018
port, and the peripheral, that has a
speed limit of gigabytes per second,

00:09:01.018 --> 00:09:02.528
or hundreds of gigabytes per second.

00:09:02.808 --> 00:09:05.918
And then the speed limit between our
serial port peripheral, like the little

00:09:05.928 --> 00:09:10.288
hardware accelerator on our chip that
handles serial port stuff, sending that

00:09:10.288 --> 00:09:13.808
actually over the physical wire, like
the electrical signals over the wire,

00:09:14.018 --> 00:09:16.078
that has a speed limit of 11 kilobytes.

00:09:16.348 --> 00:09:19.695
So we have to figure out how do we
not make our whole system slow to

00:09:19.695 --> 00:09:24.025
a stop every single time we want to
send some data over the serial port.

00:09:24.635 --> 00:09:28.295
And also make sure that that data we're
sending over the serial port is smooth.

00:09:28.765 --> 00:09:32.785
So if we send one byte and then go
off and do something else, we want

00:09:32.785 --> 00:09:36.245
to make sure that next byte to go out
on the wire is ready so there's no

00:09:36.325 --> 00:09:39.615
pause in between each byte because
that'll slow us down even more.

00:09:39.615 --> 00:09:43.325
So we want that to be like
smooth data transfer as well.

00:09:44.045 --> 00:09:47.485
So if we were to write, like- this is all
very fake, but not far off from what you

00:09:47.485 --> 00:09:50.775
would write in a microcontroller, if we
were to write a blocking send function-

00:09:51.045 --> 00:09:55.635
for each byte that we send, we might
do a for loop where we say: Okay, check

00:09:55.635 --> 00:09:57.925
if there's an error with the hardware,
because there might be an error with

00:09:57.925 --> 00:10:01.625
the hardware, and then we wait until
the hardware says, "I'm ready to take

00:10:01.625 --> 00:10:03.195
some more data to put out on the wire."

00:10:03.495 --> 00:10:05.975
Which means we're just sitting
there in a while loop, waiting,

00:10:05.985 --> 00:10:07.845
just burning CPU cycles.

00:10:08.095 --> 00:10:11.255
And then finally when it says, "I'm
ready for one byte," we give it one

00:10:11.255 --> 00:10:15.495
byte, and then we go back to waiting,
and we just sit there and wait forever.

00:10:15.715 --> 00:10:19.875
And like I said, it might be millions
or billions of CPU cycles for each

00:10:19.885 --> 00:10:21.895
byte that goes out over the wire.

00:10:22.585 --> 00:10:25.055
<v Amos Wenger>This is funny because I'm
looking at your pseudocode and it's Rust.

00:10:25.225 --> 00:10:26.785
And in Rust we have sum types.

00:10:26.785 --> 00:10:29.245
So we have results with like
an OK and error variant.

00:10:29.595 --> 00:10:33.915
But in hardware, errors are like:
this bit somewhere in a register or

00:10:33.915 --> 00:10:37.005
whatever flips when something is wrong
and you have to check it all the time

00:10:37.035 --> 00:10:40.195
and this is, you're just busy looping
instead of doing anything asynchronous.

00:10:40.235 --> 00:10:40.625
But yeah.

00:10:40.945 --> 00:10:42.015
This is fun to see in Rust.

00:10:42.240 --> 00:10:46.378
<v James Munns>Yeah, and we do that a
lot in embedded Rust is we turn, you

00:10:46.378 --> 00:10:48.198
know, you might have six different bits.

00:10:48.271 --> 00:10:51.581
Imagine like error LEDs on your
Wi-Fi radio or something like that.

00:10:51.921 --> 00:10:55.201
We turn those into an enum value,
but yeah, exact same kind of thing.

00:10:55.201 --> 00:10:57.991
But we have to just literally
check: is there an error?

00:10:58.735 --> 00:11:02.999
Now, this is awful because like I said,
we're burning millions or billions of CPU

00:11:02.999 --> 00:11:07.189
cycles potentially just waiting, doing
nothing, which is incredibly wasteful.

00:11:07.743 --> 00:11:09.113
So this is where DMA comes in.

00:11:10.063 --> 00:11:12.773
DMA is for babysitting memory copies.

00:11:14.223 --> 00:11:18.883
It lets us delegate that responsibility
of: here are 600 bytes I would

00:11:18.883 --> 00:11:20.643
like to send over a serial port.

00:11:21.303 --> 00:11:24.663
And your CPU just goes:
here are the 600 bytes.

00:11:24.953 --> 00:11:29.423
DMA, please send this to the serial
port and let me know when you're done.

00:11:31.073 --> 00:11:34.663
And the way that these actually look
from a hardware level, and again this

00:11:34.663 --> 00:11:38.403
is one of those things where if you look
at 10 chips they might implement DMA 10

00:11:38.403 --> 00:11:43.343
different ways, but conceptually you can
think of them like a very, very, very

00:11:43.343 --> 00:11:46.433
simple CPU core who can only do one thing.

00:11:46.818 --> 00:11:49.128
And that's copy memory
from one place to another.

00:11:49.138 --> 00:11:52.708
So it's not like your desktop CPU,
where, you know, you have a whole

00:11:52.708 --> 00:11:57.168
instruction set, X86 or ARM or whatever
that can do tons of different things.

00:11:57.188 --> 00:12:01.408
DMA is like a CPU core who's
babysitting memory copies.

00:12:01.948 --> 00:12:05.848
So, the way this like delegation
works is your CPU will have some data

00:12:05.848 --> 00:12:10.118
in memory- so like that 600 bytes
of serial data you want to send.

00:12:10.678 --> 00:12:13.578
And you'll have a pointer and
a length- basically a slice.

00:12:13.588 --> 00:12:17.681
So your hardware conceptually will
say, "Starting address is here, and

00:12:17.681 --> 00:12:21.111
then the next 600 bytes are what I
want to send over the serial port."

00:12:21.431 --> 00:12:25.368
So you hand those two pieces of
information to DMA, maybe configure

00:12:25.368 --> 00:12:27.298
it somehow, and then you say, "Go."

00:12:27.408 --> 00:12:29.338
And it goes, "Okay, I'll
let you know when I'm done."

00:12:29.758 --> 00:12:30.928
And it goes off and does it.

00:12:31.278 --> 00:12:35.758
And then at some point later, you get
a notification or an event or something

00:12:36.078 --> 00:12:38.048
that says, "DMA transfer complete!"

00:12:38.428 --> 00:12:41.238
So you get sort of a, on a
microcontroller, it might be an interrupt.

00:12:41.258 --> 00:12:45.305
On a more desktop piece of hardware,
it might be some kind of event or

00:12:45.305 --> 00:12:48.629
something like that, but you'll get
a notification or an event sometime

00:12:48.789 --> 00:12:50.399
later that says, "This happened."

00:12:50.409 --> 00:12:53.799
So you went off and we're doing
something else for a billion CPU cycles.

00:12:54.159 --> 00:12:56.114
And it said, "I'm done, thanks!"

00:12:56.421 --> 00:12:57.831
<v Amos Wenger>So why is it called direct?

00:12:58.551 --> 00:12:59.901
It doesn't seem direct to me!

00:13:00.896 --> 00:13:05.106
<v James Munns>It's direct in that
it allows something like a serial

00:13:05.106 --> 00:13:08.596
port  to feel like it's directly
pulling the data from memory.

00:13:08.786 --> 00:13:12.506
So instead of the CPU having to go
and poke every byte of memory into

00:13:12.506 --> 00:13:16.189
the peripheral, it gives you sort of
the appearance that this peripheral

00:13:16.189 --> 00:13:20.019
is directly pulling the bytes out
of memory to put them on the wire.

00:13:20.320 --> 00:13:20.710
<v Amos Wenger>Gotcha.

00:13:21.406 --> 00:13:25.076
<v James Munns>So this is really awesome
because we go from "busy polling" to

00:13:25.106 --> 00:13:28.556
"event driven," which means we're not
just sitting there checking status bits

00:13:28.556 --> 00:13:32.326
like you were saying: we're waiting
for a notification which frees our

00:13:32.326 --> 00:13:33.986
hands up to go do something else.

00:13:34.796 --> 00:13:38.906
Which, if you've heard of async and Rust
before, this is what we love in async.

00:13:38.936 --> 00:13:40.256
We don't like busy polling.

00:13:40.256 --> 00:13:41.216
We don't like blocking.

00:13:41.216 --> 00:13:42.896
We don't like syscalls
and things like that.

00:13:43.146 --> 00:13:44.196
We want an event.

00:13:44.346 --> 00:13:46.696
We want to be notified
when something is done.

00:13:46.926 --> 00:13:50.486
We want to be notified when a packet
arrives or a packet is finished sending.

00:13:50.696 --> 00:13:53.476
So the CPU can go, "Okay, it's
time to do more stuff now."

00:13:53.516 --> 00:13:57.496
This is essentially, we're awaiting
a signal at the hardware level.

00:13:58.126 --> 00:14:02.216
So an async version of this that has a
similar signature, like send where we

00:14:02.216 --> 00:14:05.866
give it a slice of bytes or something
like that- we might configure the

00:14:05.866 --> 00:14:10.206
DMA, so like `dma.setup()`, we give
it the source pointer, the destination

00:14:10.206 --> 00:14:13.656
pointer, and we say, "You're gonna be
transferring this to the serial port."

00:14:13.856 --> 00:14:17.476
So here's our destination address we
write on the envelope and give to the

00:14:17.486 --> 00:14:22.466
DMA engine, and we tell it to run, and we
call `.await` on what it gives us back.

00:14:22.496 --> 00:14:26.946
And that's gonna allow us to go off
and do something else until our reactor

00:14:26.966 --> 00:14:30.756
or the interrupt or whatever comes
back and says, "Hey, kick the waker.

00:14:30.883 --> 00:14:34.061
We've finished this, so now it's
time for this async function

00:14:34.061 --> 00:14:35.461
to come back and do whatever."

00:14:35.911 --> 00:14:39.451
It might check the error and
say, "Did that DMA transfer

00:14:39.691 --> 00:14:41.301
complete successfully or no?"

00:14:41.781 --> 00:14:44.908
And if it didn't fail,
then we go, "Okay, cool.

00:14:45.038 --> 00:14:45.618
Transfer done."

00:14:45.848 --> 00:14:48.198
Which meant the CPU didn't
have to do any more work than

00:14:48.378 --> 00:14:50.268
setting it up, letting it run.

00:14:50.268 --> 00:14:52.028
It gets notified, it comes
back and it checks it.

00:14:52.038 --> 00:14:54.618
So it didn't busy wait
for those billion cycles.

00:14:54.878 --> 00:14:56.068
It went off and did something.

00:14:56.888 --> 00:14:59.478
<v Amos Wenger>So I know you're gonna say,
"It depends on whether you're running

00:14:59.478 --> 00:15:04.348
on a high end desktop computer or a
tiny microcontroller," but is the DMA

00:15:04.598 --> 00:15:08.708
controller or whatever implementation
an actual separate core of something?

00:15:08.708 --> 00:15:11.715
Probably not the same as your
other CPU cores, or is it

00:15:11.715 --> 00:15:13.195
just a function of the CPU?

00:15:13.195 --> 00:15:15.325
Is it baked into the
CPU design in some way?

00:15:15.435 --> 00:15:16.205
Is it a separate chip?

00:15:16.225 --> 00:15:16.515
I don't know.

00:15:16.968 --> 00:15:17.328
<v James Munns>Yeah.

00:15:17.358 --> 00:15:21.538
So, DMA is largely, it's more of
a pattern than a concrete thing.

00:15:21.548 --> 00:15:26.534
So I mean, it is 'it depends,' but it
will be a chunk of silicon on your core.

00:15:26.758 --> 00:15:27.213
<v Amos Wenger>I knew it!

00:15:27.643 --> 00:15:30.853
<v James Munns>So whether it's on the
CPU or on the motherboard or something

00:15:30.853 --> 00:15:35.703
like that, it will be wired up
somewhere where it's a discrete thing.

00:15:35.703 --> 00:15:42.263
Like you might have multiple CPU cores on
one CPU die or like the actual CPU unit.

00:15:42.483 --> 00:15:46.883
It'll have some DMA cores on it as well on
the silicon usually, because it has to be

00:15:47.133 --> 00:15:51.143
connected to the same memory and the same
peripherals that you have on your system.

00:15:51.193 --> 00:15:55.593
Where it actually is, whether it's in the
CPU box or on the motherboard somewhere-

00:15:56.003 --> 00:15:59.602
that's where we get into 'it depends,' but
functionally it's going to be co located

00:15:59.602 --> 00:16:01.682
kind of like wherever your CPUs are.

00:16:02.412 --> 00:16:03.452
One more neat thing.

00:16:03.452 --> 00:16:08.160
So I've mentioned that DMA is great
for transferring data from main memory

00:16:08.170 --> 00:16:12.768
into a peripheral, but DMA's job is to
copy memory from one place to another.

00:16:13.088 --> 00:16:16.338
And when it's copying to peripherals,
it can be very smart because

00:16:16.338 --> 00:16:17.828
it can kind of babysit that.

00:16:18.393 --> 00:16:21.013
"Oh, peripheral, are you ready
to receive some more data?"

00:16:21.013 --> 00:16:22.013
It says, "Not yet."

00:16:22.193 --> 00:16:23.633
So DMA goes, "Okay, I'll wait."

00:16:23.653 --> 00:16:26.633
And then the peripheral knows to
directly tell the DMA, "Hey, I'm ready

00:16:26.633 --> 00:16:28.103
for more memory now, give me memory!"

00:16:28.243 --> 00:16:30.912
So DMA even has its own
sort of event driven thing.

00:16:30.922 --> 00:16:34.722
So it's got a bit more smarts when it
comes to talking to peripherals, but

00:16:34.722 --> 00:16:39.139
at the end of the day, it's copying
bytes from memory to somewhere else.

00:16:39.209 --> 00:16:41.899
Now that could be an
address that's a peripheral.

00:16:42.314 --> 00:16:44.614
But it could also be your main memory.

00:16:44.754 --> 00:16:48.438
And you might think, "Why would I want
to use DMA to copy memory to memory?"

00:16:48.498 --> 00:16:53.561
Like the nice thing about a peripheral is
the DMA can handle the slower speed limit.

00:16:53.561 --> 00:16:57.391
So it can be delegated to
babysit that slow transfer.

00:16:57.391 --> 00:17:01.321
But if we're already at like the peak
of memory speed, memory to memory, like

00:17:01.321 --> 00:17:05.861
the CPU is loading from here- why would
we want to use DMA when it's going to

00:17:05.871 --> 00:17:08.511
be the same speed as our CPU, basically.

00:17:08.511 --> 00:17:11.861
They're both going to be racing at
the maximum speed of memory access.

00:17:12.679 --> 00:17:18.029
Well, if we were to implement memcpy
with like a source slice of bytes and a

00:17:18.029 --> 00:17:22.309
destination slice of bytes- this is sort
of recursive, this fake code that I have

00:17:22.309 --> 00:17:25.349
on the screen, because copy_from_slice
is actually implemented with memcpy,

00:17:25.759 --> 00:17:28.549
but if we had something that was kind of
memcpy-like, where we wanted to copy from

00:17:28.549 --> 00:17:32.749
one chunk of memory to another one- we
could do that and it would go very fast.

00:17:32.759 --> 00:17:35.569
Our CPU would load some
memory and write some memory.

00:17:35.832 --> 00:17:39.982
But we're using essentially like
the fanciest scientific vehicle

00:17:40.382 --> 00:17:42.392
as a pickup truck at this point.

00:17:42.422 --> 00:17:47.302
Like we're using this phenomenally
capable CPU to just copy bytes

00:17:47.312 --> 00:17:48.742
from one place to another.

00:17:48.932 --> 00:17:51.332
Which is sort of a waste as it were.

00:17:52.062 --> 00:17:57.292
If instead we were to use DMA and we go,
"Okay, well, we want to copy this gigabyte

00:17:57.292 --> 00:18:01.762
of data from this buffer to this buffer,"
because maybe we're copying it to another

00:18:01.762 --> 00:18:05.020
file or a- something we're going to do
encoding with or something like that.

00:18:05.600 --> 00:18:08.511
We could just use DMA for that as well.

00:18:08.751 --> 00:18:11.951
So we can give it the source pointer
and len and the destination pointer

00:18:11.951 --> 00:18:16.281
and len and we say, "DMA- please
go off and copy this memory and

00:18:16.281 --> 00:18:17.121
let me know when you're done."

00:18:17.441 --> 00:18:21.641
So we don't get any speed benefit, or
we're not really slowed down by this,

00:18:21.751 --> 00:18:26.291
but then all of a sudden we get that
nice event-driven capability just

00:18:26.301 --> 00:18:28.151
for copying big chunks of memory.

00:18:28.391 --> 00:18:33.171
And this allows us to write, essentially,
async memcpy, if we really wanted to.

00:18:33.514 --> 00:18:35.764
Which is a bonkers thing to think about.

00:18:35.814 --> 00:18:40.014
But I like turning weird, abstracted-away
things that you normally don't

00:18:40.024 --> 00:18:43.454
handle yourself in user space, and
doing them in async because it lets

00:18:43.454 --> 00:18:44.864
you address the hardware itself.

00:18:45.284 --> 00:18:48.594
And if you were really copying a big
chunk of memory, it would free up your

00:18:48.594 --> 00:18:50.354
CPU core to go off and do something.

00:18:50.634 --> 00:18:52.564
I don't know how this would
work in a kernel and the kernel

00:18:52.564 --> 00:18:53.584
might have something like this.

00:18:53.584 --> 00:18:56.234
I mean, the kernel probably is doing
something like this under the hood.

00:18:56.454 --> 00:18:59.054
It's not in Rust async, but it
probably has some event-driven

00:18:59.054 --> 00:19:02.644
memcpys where it goes, "I'm not going
to sit around and copy a gigabyte

00:19:02.644 --> 00:19:04.294
of video file from here to there.

00:19:04.754 --> 00:19:07.724
I'm going to ask the hardware
to do it," so the kernel can

00:19:07.724 --> 00:19:08.854
go off and do something, but-

00:19:09.014 --> 00:19:10.094
<v Amos Wenger>That seems likely.

00:19:10.194 --> 00:19:13.304
<v James Munns>It's fun to think of it
as a user space async function instead.

00:19:13.484 --> 00:19:16.414
<v Amos Wenger>It feels like most of
your presentations are about, "You

00:19:16.414 --> 00:19:17.934
thought of this thing as blocking?

00:19:17.944 --> 00:19:19.684
Well, I'm going to tell
you how to make it async!"

00:19:20.218 --> 00:19:23.058
<v James Munns>This is one of those weird
things of peering through all the layers.

00:19:23.128 --> 00:19:27.808
Because microcontrollers- firmware- is
essentially the same thing as a kernel.

00:19:27.858 --> 00:19:30.618
Like a kernel is just
managing the hardware for you.

00:19:31.068 --> 00:19:34.998
And a microcontroller project is
really just: I'm managing all the

00:19:34.998 --> 00:19:37.918
firmware, but then also doing some
business logic directly on top

00:19:37.918 --> 00:19:39.558
of it, but it's baked together.

00:19:39.958 --> 00:19:44.738
But kind of zooming out through embedded
systems to designing an operating system

00:19:44.768 --> 00:19:48.418
to user space is always fun because
you go, "Where are all the lies?"

00:19:48.628 --> 00:19:52.138
Because all the layers, it's one of
those things like: well, all models

00:19:52.188 --> 00:19:54.068
are lies, but some of them are useful.

00:19:54.098 --> 00:19:57.868
And in computing, we've just built
up stacks and stacks of stacks of

00:19:57.868 --> 00:20:00.098
these useful models and useful lies.

00:20:00.178 --> 00:20:02.668
The hardware lies to
you at multiple layers.

00:20:02.698 --> 00:20:07.018
This is what like virtual memory is
and TLBs and caches and things like

00:20:07.018 --> 00:20:08.288
that is the hardware lying to you.

00:20:08.718 --> 00:20:10.778
And then that talks to the kernel.

00:20:11.323 --> 00:20:14.233
And the kernel lies to you in a
bunch of ways of abstracting away

00:20:14.243 --> 00:20:15.573
virtual memory and things like that.

00:20:15.843 --> 00:20:19.243
And then using one layer out of like what
you can actually look at in user space

00:20:19.243 --> 00:20:21.433
and what it means to block or not block.

00:20:21.776 --> 00:20:25.876
I really do love just like punching
holes through that stack of like- how

00:20:25.876 --> 00:20:31.376
could you express the actualities of the
hardware at a syntax layer of a language

00:20:31.806 --> 00:20:35.646
if you had to pretend that all those
layers of lies didn't exist anymore.

00:20:36.466 --> 00:20:38.766
<v Amos Wenger>Can I just, can I
close with a little anecdote I

00:20:38.786 --> 00:20:40.496
love about implementing languages?

00:20:40.746 --> 00:20:41.106
<v James Munns>Mm hmm.

00:20:41.616 --> 00:20:46.556
<v Amos Wenger>So you showed, you know, on
a very visual medium, you showed a slide

00:20:47.036 --> 00:20:50.946
that showed memcpy calling copy_to_slice
or something, and you mentioned that

00:20:50.946 --> 00:20:52.206
this would probably be a recursive.

00:20:52.476 --> 00:20:56.376
But did you know that when you're making
a libc and you're using a C compiler,

00:20:56.876 --> 00:21:00.396
when you feed it something that looks
like memcpy, it will replace it with

00:21:00.396 --> 00:21:03.886
a memcpy intrinsic, which is a problem
if you're writing the function that

00:21:04.326 --> 00:21:06.196
ends up being called by that intrinsic.

00:21:06.606 --> 00:21:09.086
So there are some cases.

00:21:09.191 --> 00:21:12.511
<v James Munns>You run into this all the
time in embedded systems, because a

00:21:12.511 --> 00:21:16.492
lot of memcpys provided by compilers
are optimized for speed, but on your

00:21:16.492 --> 00:21:20.002
microcontroller, you might want to
optimize for size, and optimizers

00:21:20.022 --> 00:21:21.542
love turning things into memcpy.

00:21:21.552 --> 00:21:25.342
Like, that's an optimizer's favorite
thing to do, is go, "That's a memcpy,"

00:21:25.532 --> 00:21:28.752
because you hope that memcpy is one
of the most optimized primitives

00:21:28.752 --> 00:21:31.942
that you could ever have, but yeah,
trying to get compilers to, one, not

00:21:31.942 --> 00:21:36.632
turn things into memcpys, and two,
telling them, "Hey, here is the memcpy.

00:21:36.652 --> 00:21:37.682
I'm bringing it myself."

00:21:37.882 --> 00:21:38.122
Yeah.

00:21:38.142 --> 00:21:42.182
That happens all the time where you
end up like the linker just explodes

00:21:42.182 --> 00:21:43.242
because it goes, wait a minute.

00:21:43.242 --> 00:21:43.542
What?

00:21:50.014 --> 00:21:52.134
Today's episode was brought
to you by the Embedded Working

00:21:52.134 --> 00:21:53.104
Group Community Microsurvey.

00:21:53.985 --> 00:21:56.935
The Rust Embedded Working Group is
running a community survey to learn more

00:21:56.935 --> 00:22:00.675
about the people using Embedded Rust for
hobby, university, and production usage.

00:22:01.282 --> 00:22:03.872
The survey is anonymous, should
take less than five minutes, and

00:22:03.872 --> 00:22:05.532
your response helps us out a ton.

00:22:06.537 --> 00:22:09.917
The survey will be available
until September 19th, 2024.

00:22:10.229 --> 00:22:12.869
Check out the description of
this episode or our show notes at

00:22:12.869 --> 00:22:16.169
sdr-podcast.com for a link to the survey.

00:22:16.439 --> 00:22:17.329
Thanks for filling it out!