Frame Synchronization

RSS podcast badge Spotify podcast badge Apple podcast badge YouTube podcast badge

Just simple enough to cause problems

An overview of how devices decide how to split streams of bits and bytes into frames, and the things that can go wrong

View the presentation

Video

Audio

Download as M4A

Show Notes

Transcript

Frame Synchronization

James Munns: This is one of those things that like- that valley of it seems simple enough that everyone does it and everyone gets all the little subtle things wrong and sometimes don't even know the right name of terms, so... today I'm talking about frame synchronization.

Which is a big fancy word that is step one of make computers talk to computers. And it actually is a fairly straightforward thing, but the problem really boils down to:

How do we decide where one message ends and the next message starts? Which if you live in a world of TCP or UDP sometimes you feel like this is very simple, but even in TCP and when I'm talking about embedded stuff, usually things like serial ports, it gets a little trickier.

Amos Wenger: I already have a story. I'm sorry to interrupt. But I remember I did some work regarding diffing and patching when I worked at itch.io, the game marketplace for indies. And it's so fun because if you read papers from people working on string sorting algorithms, which are mostly used in biocomputer science-

James Munns: Like DNA sequencing patterns and stuff?

Amos Wenger: So what they do is they're like, "Well, you see, you have an alphabet and then you just make up, like, maybe your alphabet is like 256 values and then you make up a 257th symbol and that's your, your stopper. That's the end value. If you encounter 257th symbol, then you know the sequence is over.

And you're like, "But that's... that's not how bytes work. That works in the paper, but not in my code!"

James Munns: This is exactly what we're gonna get into. So, we're talking about usually a sender and a receiver. And a lot of times it ends up being bidirectional, but we can look at just one of those first. And the metaphor, which is totally... I wanted to make it equally out of date for everyone. So we're gonna be talking about telegrams. Or telegraphs, actually. So you have, you know, an office in one city, and then a wire that goes to another city, and you've got someone with Morse code on that. But these are people who don't have cell phones, and they don't have a landline that they can call each other. They literally only have the wire that goes from the one office in Austin to the other office in Chicago, or whatever cities you would like to use for the metaphor.

But they only have the one wire in between them, and they've got Morse code or something so they can decode symbols from each other. But they want to act like, sort of like a post office. People want to walk in off the street and say, "I want to send a message to my buddy in Chicago. Here it is on a piece of paper. Please get it there."

And we have to figure out how these two people operate with each other and negotiate with each other.

And when something goes wrong, if one of them gets up for coffee and they come back and they sit down and they missed the first half of the message, how do they figure out how to get unconfused with hopefully the least collateral damage possible?

Like, okay, at least I only lost one message with my unplanned coffee break. And when we're trying to come up with this, like you're saying, like in those papers, how do we decide which of these solutions are good and which ones aren't good?

Like which ones sound good, but in the real world are practically- like in your example, where I have a 257th value. Okay. Now, well, every symbol has to be two bytes now, which means we've just reduced our total efficiency by half. Is that good? Or is that maybe not so good?

What's good? Efficiency, robustness & simplicity

James Munns: So when we talk about good, there's efficiency- so like, how efficient are we at using all the values we can send over the wire?

There's our robustness, so what happens if a message gets lost? Like we walk out of the room or someone cuts the wire or the connection 's bad.

What if the data is corrupted? So like, what if we get some weird radio noise and it makes it sound like a message is something else.

Or desynchronization. This is sort of that, like: you walk out of the room and you walk back in halfway through and you go, "Wait, I don't know if... how many did I lose? Am I halfway through a message? When's the end of this one?" How do I get back on track?

And simplicity. Like the least complexity we can get away with is usually great because there's less places for things to go wrong. You can have like a very complex setup but it means that now you have to think about all this failure modes and this tree of failures that I could possibly have, and deal with that.

James Munns: So we're back on our telegraph. We have one sender and one receiver.

And most importantly, they don't control the content of the messages that they send. People are going to walk in off the street and want to send whatever they want to send. And you can't really trust the people walking in off the street aren't trying to mess with you just because that's what they find fun. Because people do that for... fun.

How to "frame" messages?

James Munns: The only thing we can control is how we as telegraph operators have agreed to send these messages back and forth- the framing, how do we package these messages and encode them and decode them over the wire.

You and I can decide on that because we went to a convention somewhere at some point and decided and wrote down these rules on a piece of paper but we can't make changes to people's messages to just fit how we're feeling today.

So, normally there's two main ways that you can do this. You can either do what's called "in band" signaling or "out of band" signaling. Out of band signaling could be something like: we have two wires, and only you and I are allowed to use that second wire. We use that wire to just beep every single time we're done. Or, we do something that's illegal. Like, we hold down the button for way longer than we should. Like, someone can't walk in and tell me to hold down the button for ten seconds.

But I could decide to hold down the button in a way that people off the street aren't allowed to do. And these go back to those efficiencies of like: well, now I have to run two wires or now I have to spend a bunch of time holding that down.

And that gets into our trade offs. But these are things that are illegal for the message to have in them, but are legal for you and I, as the operators, to decide on. This is like the ethernet frame does something that you're not allowed to have in the payload of the ethernet value or something like that.

So these are like, common solutions that I see and how they're actually not very good. But if you haven't thought of how all of the ways that they can fail, how you might think they are good, but they are perhaps not so good.

Line silence

James Munns: So one of them is just line silence or holding down the button for a certain amount of time. You say, "Look, between every message, we're going to sit quietly for five seconds. And I promise that while I'm tapping out that message to you in Morse code, I'm good enough at Morse code that I'm never gonna pause for more than five seconds." And if you hear a silence of five seconds, you go, "Rip off that piece of paper, that message is done, get the next piece of paper, I am ready for the next one."

And this is really great because it's simple. You don't have to think that much. You go: if it's quiet for five seconds, I have a little stopwatch or something. If it's quiet for five seconds, we know it's done.

And it's not something that someone can affect. You can't write in your message, "Please pause for five seconds." Cause I go, "No, you can't ask me that. You can give me the data that I'm going to send. You can't tell me how to send it."

Now it's, you know, somewhat efficient? You know, it's not that long, depending on how long you need it to be, it could be less efficient, but it's fairly robust.

Like it's something that we don't have to worry about:** as long as you promise that the sender can never get distracted.**

If someone walks in and talks to me while I'm typing out this message, can I ever be distracted for more than five seconds? Because if I can, then we have to increase that time for it to be suitable.

This sounds weird for a telegraph operator, but if we're talking about a microcontroller or an embedded system: if we have hardware interrupts or someone presses a button or it's time to draw the screen, how do I guarantee that I'm never distracted from receiving messages for more than the amount of time?

And do I actually have a stopwatch that I can accurately measure that time so that if I'm counting to five and I only count to 4.9 because I got a little distracted and you think you've ended a message and I don't think you've ended a message, we can get out of sync. So we do need maybe a longer amount of time than you might think, just so that if something does go wrong or weird, or I get distracted, it's not too little of time.

So this one's maybe okay, but also sometimes it requires some hardware capability that is more challenging than you might think to do accurately.

Amos Wenger: Right. And to make it more reliable, you have to sacrifice efficiency.

James Munns: You make it longer so that you go, "Well, I promise I will never be distracted for more than 200 milliseconds."

Amos Wenger: Yeah. Yeah. It's easier to imagine

with smaller values 'cause like yeah. The telegraph example shows its limits here. But if you're thinking of like, it's, I dunno, it's four milliseconds or something. Yeah. I can see them getting stuck for that long.

James Munns: Yeah.

Length header

James Munns: So the other one, and this is the one that I see most common when people roll their own is they use a length header.

So I might say, "Look, I'm going to count the number of words in this message. And before I send the message, I'm going to send the number of words," and that way you can count how many words you receive and you go: cool! He said 200, and then I count my messages, and when I get to the 200th word I rip the piece of paper off the sheet and I know that I am done and now I start listening for the next one.

But let's say I went to the bathroom and I come back and we're in the middle of receiving a message. You are sending me a message about how many horses are being sent on the train to Chicago today or whatever. And I go "Shoot, shoot, shoot, shoot. Uh, cool. I need to get back. I need to get back... uh, let me just listen for a number. I'm- I'm back I was distracted, but i'm listening for a number," and you're talking about sending 200 horses. And it's the last sentence of the message and I go, "Oo! Oh, they said 200!" So I write down 200 and I start counting 200.

There's four words left in the message. So now, I've started counting 200, we get to the four words in your message. Then there's some silence while we're waiting for you to send the next message, you start counting the next one. And I get to the end of the 200, and you're still talking about someone else's message, and I've totally lost this one, I've become desynchronized, and everything is terrible.

So this is one of those things that I see people do when they have TCP with TLS, where you can guarantee that all the messages will be there. You've got an operating system, buffering messages for you and things like that. And if you have TCP and TLS, it's probably fair. If you lost some messages, your TCP connection is just going to fail.

And TLS is going to go, "Oops, we're out of sync." And you'll either recover with TCP or you won't. And the connection will be reset or whatever. But even with TCP with no TLS... there's not that much- like TCP is a reliable protocol. But TCP itself only has a 16 bit CRC on it.

The chances of something going wrong is not impossible. And if that 200 becomes 201 because you have a bit error, all of the sudden now you are totally desynchronized. And I am listening into your next message and all of a sudden it's not a number and I am very confused. Or I could lose a byte or- you know, all these ways that things could go wrong, if you don't have a really rock solid medium like TCP and TLS.

On a serial port, if you get a little glitch and all of a sudden you lose a byte: what do you do? How do you recover?

So this length header is very efficient because it only takes, you know, us putting the length that we're going to put at the front of the message. But once we get out of sync, we're done. And trying to figure out what is a data number, and what is a header number is tremendously hard to figure out if you only have a length header prefix.

Amos Wenger: Which is why you always add a four byte magic number in your header.

James Munns: It's true! And you can reduce the false positives... but there's always a chance that someone could come in and go, "Okay, I'm going to send a message with just a number followed by the magic word over and over and over and over and over and over again, and see how much I can ruin your day."

Amos Wenger: Some people do that on purpose for fun. I think it's called a queen, a quine. What, how's it pronounced? Like a-

James Munns: Oh, well, quines are when you make a program that encodes a different program that eventually wraps itself. I was thinking of phreaking, like, P H phreaking, like phone phreaking. The people who had you know, they found a whistle...

This is in band signaling, like when you have a phone and you play a certain tone that signals the end of the call, or 'stop billing,' and you can just play that little tone in there and all of a sudden the hardware in the central office goes, "Oh, stop billing. Okay!" That's in band signaling, which we don't love.

Amos Wenger: Some people could do it by hand, like with no hardware, no, physical whistle. I was thinking of the files that are like valid PDF, but also a valid PNG and also a Linux executable and also something like, that's very fun. You can- yeah, you could just accidentally, sort of on purpose, use other people's magic numbers because it's fun.

James Munns: Trusting- trusting user-provided data is always spicy and fun.

Amos Wenger: Yes, Amanda, please, please edit in the honk sound from, uh, Untitled Goose Game here.

James Munns: Exactly.

Byte stuffing

James Munns: So, the other one that I see people use, and this is like common back to the dial up days, is using a flag word. So we decide a certain word means 'end of message,' but like in your DNA example, if we're sending bytes, we only have 256 different values. And someone might want to include that value.

Like, if we're sending ASCII, there's probably some ASCII control character we can go: Okay, well, there's literally a character called ETX for end of transmission. Which, if you're just sending ASCII characters and you agree on the rules that someone walking in off the street can only send ASCII characters: cool, you've got all 128 reserved bytes that you're allowed to use- or, excuse me- 128 characters or whatever that you're allowed to use and then half that you're not allowed to use because they have special semantical meaning.

It does mean that our efficiency takes a hit because we're very rarely sending something in that back half, you know, we lose a bit of efficiency, maybe it's fine.

But this all breaks down when we leave the world of plain text and we want to send a photo or a video where the color values that we're actually sending across the network here can be the full range.

Like it can be any value of a byte. And so you might accidentally send a picture that's all brat green, and it turns out that brat green has exactly the same value as ETX or something like that, and you're sending a bitmap across the network that is "Oops, all ETX values."

Amos Wenger: I'm sorry. Is this, is this a technical term? Like brat?? Okay.

James Munns: I don't know!

Amos Wenger: I'm sorry. I didn't keep myself up to date. I just, I'm sorry.

James Munns: Do I look like Pantone to you? But yeah, so, I mean, the normal way that you say it is you go: Okay, well, ETX is our end of transmission word. And if it happens to show up in your message, tell you what, you're going to escape that message. You're going to send the character DLE. Or data something escape before that. And this is kind of like putting backslashes before new lines in a text or something like that.

You say like, "Ignore the next thing that I'm going to send. It's a normal data byte, not a special meta character," or something like that.

But then the problem becomes: well, okay, well, if you're sending that message that's all that escape character, you now have half efficiency because every time you send it you have to send the escape character behind it. And then what do I do if the two data bytes are "Ignore escape character, end of message." And then you have to put more- it's like that problem where you just keep adding more backslashes until it works.

And if you have a malicious attacker, they can blow up, "Okay. Well, you said that I can send a hundred data characters for a dollar," or something like that.

"Cool. The hundred that I pick are the worst nightmare ones for you." And all of a sudden you're sending five, 600 bytes of transfer for 100 bytes of actual data across the wire.

Amos Wenger: Mm hmm.

James Munns: And so this generally works as long as you have, you know, scrambled data or whatever, if it's encrypted or whatever, and you're unlikely to get a run of exactly the same characters all over the place. Sure, you can maybe make it work, but it's either unbounded or an incredibly high bound of potential worst case going on.

And again, if you're on an embedded system and you need enough room to buffer all of the bytes that you're going to receive, you got to think about that because either you have to decode everything one byte at a time so you can do the decoding, or you need to DMA a huge chunk of data. But if your worst case is eight times bigger than your potential data transfer, then that's a lot of wasted space for something that almost never happens. So these are all the ones that I see people commonly use. These are the, "Ah! I know a trick to solve this." And then you end up with people where you go: well, what happens if you reboot one device cause you did a firmware update halfway through a conversation? Now they're out of sync and these two devices can never get to each other or they get accidental buffer overflows because: oops, my decoding used eight times more of the space that I thought I could use.

Amos Wenger: One problem I didn't, I thought about, but I forgot to bring up with the length header is that you don't always know the length of the message you're sending. I imagine in the embedded world, most of the time you do, but I, I'm, I am now thinking about HTTP cause you forced me to. Also, cause I've been working on this implementation and we have, you know, we have chunked transfer encoding and you don't know, you're just sending bits at a time.

So, you know, the length of the chunk, we don't know the length of the total message. And then you have to do something even fancier.

James Munns: Yeah. And these frames are chunks, but it's still a problem of like, you can't opportunistically send things sometimes, if you don't know the size of the chunk, which might be variable and things like that. So it also goes back to efficiency of like: you have to be able to buffer up some amount so that you can count how much you buffered so that you can send it versus a totally asynchronous, like, stream of bytes, for example.

My favorite: COBS

James Munns: So we have enough time to get into my favorite one, which is still from the days of dial up, but more recent dial up, like early 90s or something like that. And it's called COBS, or consistent Overhead Byte Stuffing. Which is actually a funny combination of most of the previous ones, but it allows you to do things in a fairly- like it's a little more complex than any of them, but has almost no totally awful, pessimal case.

So I'm just going to run down some of these rules and we'll get to the point where we have a total rule between the two of us. And now we're back to talking about real world 8 bit bytes instead of, you know, telegraph words.

Amos Wenger: Do I get a guess as to how it works? Knowing that I've never- the, acronym is is familiar. I'm pretty sure I read about this before, but I've already forgotten all about it. Do they just change- like, does the- is there an escape character, but it depends on the position in the stream? So that you can't, like, the pessimal case is never really hit? Cause it changes around?

James Munns: Sort of... but okay, let me explain it. So, first rule: a null byte always means end of message. If you send a zero on the wire, that's the end. That's the, you know, perforated edges, we are now done, tear off the old message, go with the new one. So then you say, well, what if I want to send a lot of zero data bytes?

So the first rule is a zero always means the end of message. And whenever you give me a user message, I'm going to stick a zero at the end of your message. I'm going to append a zero to the end so that I know that that is the end of the message. But now I need to figure out: what do I do with all these zeros inside of the message or the potentially the zeros inside of the message?

Well, what I'm going to do is I'm going to take my zero at the end and I'm going to work backwards. And anytime I run into a zero, I'm going to count how many bytes it's been since the zero, like working backwards, and I'm going to replace the zero with how many bytes until the next zero. And then when I go back to the next user zero, I'm going to be counting again from that new position and I'm going to count how many bytes it would be to the next zero that got replaced.

So essentially what I'm doing is I'm making a linked list of all the zeros in my message. And I work all the way until I get to the front of the message and I put the number there. So let's say I put 14 there. So I will start with 14. I know that's a header byte I'm going to take the next 13 bytes as normal data bytes. And when I get to that 14th byte, it's going to be a number that's not zero. Let's say it's eight. That's where a zero should have been.

So the data I'm going to replace there with a zero, and I'm going to start counting again until I get to eight. And if you ever are counting down like this and you hit a zero, you know that you've reached the real end of the message.

Amos Wenger: And it also means you, you can tell the boundaries of the messages without knowing any of this, just by looking for zeros. Yeah.

James Munns: Yep, before you actually start decoding this, you can just keep buffering until you get a zero and then you go, "Ah, now it is time to do something." And so you go look at the first byte, you walk forward, and essentially this is also a little bit of an integrity check. If you hit the zero when you were counting but that's not where you expected, you know that you've lost something. And it means that you go, "Welp, nope, that was a poorly framed piece of data."

And the nice thing is, is in this case, you have at least two bytes of overhead. You've got the zero you're sticking at the end, and you've got that number that you're sticking at the front. If you sent all zeros, if you were trying to be an ass and sent all zeros here, that's actually perfect for me because then every data by just becomes a one. It just goes link, link, link, link, link, link, link, link, link, link.

The worst case is actually you send no zeros. And this is the sort of final rule. If you ever get to- I forget the number, it's like 254, you don't go all the way to 255- but if you get a 254 and there hasn't been a data zero, you insert one more header byte there that you know that you're going to throw away. It's basically kind of like sticking an extra header byte in the middle of the message.

Amos Wenger: Right, yeah.

James Munns: So, you know, that if you ever see a 254, you go: ah, I'm going to be counting towards the next one, but there's no actual zero data byte that I'm replacing there. It's just a placeholder. Which means your worst case is bounded in the worst case to be two bytes plus one for every 254 data bytes.

Amos Wenger: Right, yeah.

James Munns: Which means effectively for it, like if your max message size is like a kilobyte, your worst case overhead is something like six bytes or something like that. You can see it scales really proportionally to the length. And most of the time you do just randomly have zeros in there. So you don't even hit the worst case very often.

Amos Wenger: Yeah, Yeah, that's... that's so- I like that because it's a trick. It's just someone figured out a nice design and it's it's as old as the world itself almost... like the 70s probably?

James Munns: And it's one of those things where this is something that is used in telecoms, like when you're streaming data between telecoms and things like that, and you're trying to do message framing of where does one message end and when does one start.

This is something that's fairly simple, which means you could do it in electrical logic or whatever. And there's been a lot of papers on this, of like: how do we do message framing? Where people come up with these incredibly complex encoding schemes and things like that and the baseline usually is COBS because you go "Well, it's not perfect," but to be honest, it's dumb as hell and it's really easy to do in either hardware or software. And you get some of these robustness checks for free. Like, you know, I can tell if I have a properly terminated message, I have a very clear way to say if I get a zero we're done here

Amos Wenger: Yeah, yeah. I was thinking like, I now better understand the slide where you said, "How do we get unconfused?" Because this is a nice way in COBS is like, if you don't know what the heck is going on, just wait for a zero and try again with the next frame.

James Munns: Exactly! Yeah, and if you ever get confused if you're halfway through a message, and you go, "... That didn't line up," like something just went wrong or I had an error locally, or I walked in from the bathroom, you go, "That message is gone. Sorry. Sucks. But now I know my job is to sit here and wait for the zero."

And then we can at least move on from there.

Amos Wenger: Exactly. But you don't have to like, tell the peer: hey, can you... I don't know, like something went wrong. You just need to keep listening and ignore until you reach another, another zero.

James Munns: And that's your resynchronization point. It means that you have this perfect resynchronization point, which is a zero on the data.

So useful, I use it in Postcard

James Munns: This is in fact so useful, I use it in Postcard and Postcard has a built in flavor for this. So as you're serializing, you can do the COBS encoding at the same time, or as you're deserializing you can do the COBS decoding at the same time. Where essentially as you're pulling bytes from the wire or pushing bytes to the wire, you can just do this encoding or decoding because it requires like... a couple of lines of logic and like one or two bytes of intermediate context to store.

But it's basically stupid simple and you can do it for almost free as you're going.

Amos Wenger: God, I like this trick so much.

James Munns: So if people are building it themselves and you've got a serial port from here to there and you have to do it and you don't have a reliable line, like TCP and TLS: probably use COBS with something like Postcard so that you can just get the encoding and decoding of frames for free and try and think of what your hardware can support to figure out how to hardware accelerate it.

And we have talked about DMA. And so this is one of those fun things where you can just say: hardware, give me messages until the line goes quiet for a little bit. Okay. Now I'm going to look through that, see if I got a zero. Pull that out, save the remainder off, and then wait until the next time my hardware pings me for something to be ready.

So, it is one of those things that it's very easy to automate at this point. Which is why it's one of my favorite. And I have to say thanks to Whitequark, who showed me, because I was trying to come up with something dumb and complicated, and Whitequark went, "Have you heard of this thing from the dial up days?"

and I, I went, "No..." and it became my new favorite thing, and it's, it's just become one of my go-to tools now.

Episode Sponsor

The Self-Directed Research podcast is made possible by our sponsors. We offer 30 second host-read ad at the end of every episode. Not sure how to get your message out, or what to say? Let us help!

If you'd like to promote your company, project, conference, or open job positions to an audience interested in programming and technical deep dives, send us an email to contact@sdr-podcast.com for more information about sponsorship.

Thanks to all of our sponsors for their support.