Self-describing formats, but at what cost?
An exploration of self-describing vs non-self-describing formats, and how it changes the shape of your programs more than you might think
Video
Audio
Show Notes
Episode Sponsor: Descript
- James' post on cohost "What good is partial understanding", and RIP cohost
- Postcard-RPC
- RFC 3339 or on Wikipedia
- Hyrum's Law, spacebar heating xkcd, quotation marks
- Columnar storage formats, Apache Arrow, Apache Parquet
- JSON, ProtoBuf (Protocol Buffers), CBOR (Concise Binary Object Representation), ASN.1 (Abstract Syntax Notation One)
- MessagePack, Cap'n Proto, rkyv, abomonation
- Go, "No true Scotsman", "Here be dragons", SemVer compatibility
- Postcard
- cargo-semver checks
- Patreon API, RIP Patreon API
Transcript
What good is partial understanding?
James Munns: This week I want to talk about: what good is partial understanding. And this is actually sort of a, redux of a post I wrote on cohost, RIP cohost...
Amos Wenger: RIP cohost.
James Munns: When I was figuring out a lot of things that ended up becoming Postcard-RPC and some other stuff that I'm working on. But when it comes to two machines, two programs communicating with each other I was trying to figure out what benefit partially understanding messages actually got you, and whether it was a reasonable thing to try and do.
Amos Wenger: So it's, it's still a technical presentation?
James Munns: Yeah...
Amos Wenger: Because we talked about doing like talking about other topics at some point. I, that's something I'd like to do, personally in my career, but it's like- partial understanding... I was like, "Okay, we're going to the human sciences now? What are we, what are we doing?"
James Munns: No, I'm extremely stuck on machine to machine communication-
Amos Wenger: I can see!
James Munns: It absorbs all of my idle thoughts.
Amos Wenger: Yes, alright, I'm listening.
James Munns: But this actually has no code and you don't have to know any encoding formats and we're going to use... well, we'll get there.
But let's say you ask me what time it is right now.
We are two computers-
Amos Wenger: James, what time is it?
James Munns: We are two.. uh, out of order, reset, reset.
We're two computers, we're programs written around the same time with a common understanding of things. And you asked me what time it is right now.
How do we talk to each other?
James Munns: Today, I might just say, "11 04 27" that's my whole response. You asked me what time it is. This is how we've agreed to talk to each other.
We've agreed to say: hour hour, minute minute, second second. This is our common understanding of how we will uh, communicate with each other.
We know generally that our hours, zero to 24, not inclusive. So we're using a 24 hour time cycle.
We're using 60 minutes in an hour, 60 seconds in a minute, those kinds of things. We know generally what's expected to be in these messages. It's going to be those three sets of digits.
We're both programs, we are written around the same time. You're off running in production, and I've decided to improve myself, or my programmer has decided to improve me unilaterally.
You're off running, you don't get a chance to be recompiled or reprogrammed or whatever, but someone decides that hour hour, minute minute, second second is insufficient for what we would like to be doing, so I'm going to decide as the sender of this what I would like the format to look like. And I've decided, hey, let's get closer to a good standard, RFC 3339, and I'm going to send year, month, date, hour, minute, second, sub second, millisecond.
More information is better, right??
James Munns: So 2024, 10, 17, 11, 04, 27, 014. A very reasonable thing to do. I've decided to give you more information. I go, "How could anyone be upset at me for sending this additional information? I'm just giving you more options!"
But if you were a program that was written, assuming the only thing you would ever receive is three integers in the range format that you expect, those kind of things, you would be very confused because you are not a human, you are a computer, and computers really can only do what we've told them to do, and if we decide to say, this is what you're going to get, then you're going to be very confused when you go to read an hour that is four digits and starts with 2024, because you go: that's, that's not a very good hour.
You might be very confused...
James Munns: There's various failure modes that could happen here. You could just say, "I got a bad response." You could parse that first 20 as: it's 8 PM. And then you could try and parse 14 as the minutes- you know, there's a lot of failure modes we could do from totally rejecting it, which you should probably do.
And then totally misinterpreting it.
Amos Wenger: I was going to say the only reasonable option here is to completely reject it. But then I remembered about parse int in browsers, which would definitely take like 2024 and be like: okay, that's a number. And then encounter a space and be like: that's no longer a number. Let's just return 2024. I think that's what it does. I'm pretty sure.
James Munns: Yep, and then once you mask that down, you either overflow or you just get a random number that you don't expect, but things are not going well in our communication today.
And that's because this format that I've described- you know, hour hour, minute minute, second second- is not self describing. The message doesn't tell us what it is.
It just assumes that we've pre negotiated what a reasonable set of things are, and if you know, you know, and if you don't, then you're out of luck.
So we say, okay, you know what? We've broken someone's code. We updated our time server. We thought it was going to be lovely. Our users are now incredibly upset because all of a sudden their things are breaking because they didn't know how to understand that.
Fine, let's make it self describing
James Munns: We go, you know what? We're going to fix this by using a self describing format. We're going to include enough information into the message that you can figure it out, even if we change things over time. We make it forward compatible or, or whatever you want to describe it as.
So we take our 11 04 27
and we add a suffix on each of those numbers.
So we say 11h 04m 27s. And as a human, you read this and you go: okay, cool. That's hours, minutes, and seconds. I can look at that and figure out which one's marked hour, which one's marked minutes, which one's marked seconds. And now all of a sudden our machines can figure it out too. It's readable to a human still, and it's still readable to a machine.
So we now semantically are thinking about this. I've colored each of these so we can see what the machine maybe parses this as, as when we look at it. We've got hours minutes and seconds as a separate concept that our computer is thinking in.
Cost of doing business
James Munns: The first thing we notice is that our messages are now bigger. I had to add a delimiter or I had to add some extra information that allows you to recover the shape of the message.
So instead of just sending my three integers, I'm sending a character or some kind of field delimiter in my messages.
But you know what? We understand things. And maybe that's the cost of doing business. We were paying a little overhead, but what we gain in flexibility surely makes it worth it.
So then I go back to making the upgrade that I always planned to do.
And now I send " 2024 y 10 M 17 d 11 h 0 4 m 27 s 0 1 4 i." And this is a lot more than the system was asking for when it just wanted the time. But because we have this self describing format ability, we can ignore all that extra suffixes that we don't understand. And we find the three things that we care about.
We care about the H, the M and the S. We've successfully received this message, even though- like your high school word problems, you've got a lot of extra information and you had to figure out which ones you didn't care about, but we have solved this word problem even as a computer and not just as a human.
Amos Wenger: Is it bad that looking at this I'm I'm immediately thinking of: Oh, this should be like length prefixed or something. Also: What are, what if something's not Or like, do you first, split on space and then that's how you know? Like you get the last character and so- what what if you have more than 256 or whatever number of ASCII characters, different field types, like all of those things are immediately springing into my mind and that's why I charge the hourly rate that I do.
It's because I'm a senior engineer, baby.
James Munns: I was going to say: you are a person who has been burned by protocol design before, which I think is how we- our collective trauma is how we've ended up here.
Amos Wenger: But I'm assuming you're building up to that.
James Munns: Oh, we'll get there.
So this is neat. This is, this is what we wanted, right? We want to be able to change stuff. We have flexibility, we admit that our servers and clients are not written by the same people who have different wants and different needs, but we need to make something work.
So cool. We've got our extra overhead. We're processing all these extra fields that we don't care about, but whatever overhead is overhead, it works. Who cares? Computers are cheap.
What happens when info is shuffled?
James Munns: So something has changed. I'm having, you know, a reckless day.
I've decided to switch from the ordered messages that I was sending before. And now I represent these internally as a hash map, which means all my fields get shuffled depending on whatever my hash seed is for the day. And I send you 27 S 11 H 0 4 M. This is not hours, minutes, seconds, like we talked about, but the suffixes are still there.
So when we go to parse this, we say the S is the seconds, the H is the hours, and the M is the minutes. Even though this was out of order, this self describing message format got us the ability to recover from that. We say we know which field is which, even if they didn't show up exactly how we expected them to.
Amos Wenger: I get a chance to talk about, I think it's Hyrum's rule? Which is that everything that's noticeable, even if it's not documented, ends up being part of a public interface, whether you like it or not. So I am also immediately thinking of the case where people assumed: Oh, if we get like "number number h" then it's going to be followed by m and then s.
So they didn't actually check the suffixes that is like, if they encounter h, they just keep reading that exact number of characters and interpret them as minutes and seconds. And then this would completely break if you actually changed the order.
But yeah, that's, I think, I think it's Hyrum's rule: you have not documented this anywhere, but people have been using your API for a while. They've noticed some patterns and they've started relying on them. And now you have to uphold that even if that wasn't part of your plan.
James Munns: Yeah, this is the spacebar heating xkcd. But yeah, this is also " pains of why you might not want to design your own encoding format and home roll your own parsers or serializers because edge cases are a thing."
This is wonderful because I, even though you shuffled things for me, I was still able to understand it because there was enough information in the message itself that told me how to think about this message. A bit of the schema of the message was within itself. I know what fields are what, even if they don't show up in the same order.
The downside is, I can't understand this message in one pass Like you were saying with the home-rolled parser, if I wanted to be very clever and efficient, I might go: take an integer, take an integer, take an integer. You know, I can do that linearly in one pass. I've convinced myself that it is very optimized and a good thing to do... but you can't do that.
James Munns: If we admit that our format is allowed to have out of order identifiers, because we didn't say they always appear in this order. We just said: if there are hours, they will appear with an H. If there are minutes, they will appear with this. And so when we get bonus data, great, we can skip those. And when things are out of order, we can recover from that.
Amos Wenger: Bonus data is forever seared in my brain as like... memory corruption in C.
James Munns: Yeah, yeah, yeah.
Amos Wenger: It's the thing you don't get in Rust, but in C, if you accidentally read past the end of a buffer, which happens a lot, then you get bonus data.
James Munns: So much bonus data. We can see that we can't parse this in one thing, because if we were going: hey, what are the hours? We have to go to the middle of the message. If we go: what are the minutes? We have to go to the end of the message, or what are the seconds? We have to go to the beginning of the message.
So we couldn't do this in one pass. Maybe we could turn it into a hash map that we can query, but again, we have to do sort of out of order grabs from this. There's no more just, grab the next integer, grab the next integer, grab the next integer.
From decoding to querying
James Munns: And this is an important step because we've gone from decoding that message to querying that message. This has gone from something that we are just slurping the bytes and transforming them into our internal format, to: we now have this object that we have to ask questions.
Do you have this? Do you have this? Do you have this? And that's not bad. This is a common thing, but it's important to note that even though these messages seem very similar, our mode of interacting with them has changed very, very dramatically to go to this self describing format.
And then what if I just send 11h04m? Maybe I have a weird thing in my code, where when the seconds are zero, I just null them out for some reason because I'm trying to be clever or something like that. But the receiver of this message doesn't know that convention. They go to query hours, minutes, and seconds out of this message they've received, and they get an error.
They say, "There are no seconds field," and then you have to decide. What do I do in that case? Is that just: Oh, well, I just returned the error because my query failed. Is that: Oh, when there's no seconds, I default to zero... but then how do I know that zero is more reasonable than 30 seconds or 59 seconds.
Maybe as a human, you could go: whatever, if I just say 1104, I'll assume that it's a rounded number or zero seconds or something like that. But computers don't guess unless we've told them to guess.
Amos Wenger: Or, if the format you're using is based on a language that insists that zero values are always meaningful and good, and every field is actually optional. If you know, you know.
Queries can fail
James Munns: Now we have to admit that queries can fail. It's not just the deserialization failed because we received a well formed message, but our information that we wanted to pull from it failed. So it was a good message, but not what we needed. You mentioned this about types before.
What if I send the word "eleven" in quotes, the word "four" in quotes and the word "twelve" in quotes, and I still prefix them with H M and S.
This is a reasonable way of formatting this. And if you're whatever wire format you're using has the ability to have both numbers and strings in the messages... what's wrong with this? I said, I was going to send hour minutes and seconds with a suffix. We maybe never agreed that numbers were the only way that we were going to do this.
Amos Wenger: Just the large air quotes while you were saying, "It's a 'reasonable' way to format this.
James Munns: Yeah. I'm making a bit of a straw man here, but...
Amos Wenger: Yes, well, it's so wasteful to just actually use, well, JSON... uh, actually is double quotes, but also the hilarious part is that I believe on your slides, they're not even actually double quotes. They're smart quotes. So that makes it even less reasonable.
James Munns: Typing on a Mac.
Amos Wenger: Yeah.
James Munns: But now we have to admit that types are part of our queries too. We're not just saying the H, the M, and the S are important to us. We are saying we are specifically querying: I want an integer with the H suffix, I want an integer with the M suffix, and I want an integer with the S suffix. So now not only are our queries based on the key or the specific tag that we're using, but also the type of the message. If we're using a format that has more than one type that is numbers.
"Good" messages but still insufficient
James Munns: And this is a tricky thing to realize when we're writing our client code is that messages can be well formed, parsable, and queryable, and still be insufficient for our needs. And this means that we need to handle errors at every single one of those steps.
Did we receive a message? Did we receive a message that can be formatted in this format that we've decided on? Is it a message that has fields that we expect in there? Are the fields of a form that we expect them in? And we just said: well, we'll make this self describing, you know, the overhead is just the three characters that we're adding as tags, or sometimes we have to add bonus data and things like that.
But when you switch to a format that has this level of flexibility, you have to figure out what our response is when we hit all of those flexibility cases. And this is one of those things that I think people don't always realize the full extent of cost. I think Rust makes it a little bit easier because when you access something like a JSON message, you have to call the get APIs on it and it might return none or the null representation.
So in Rust, you're probably having a match statement that's exhaustive and you're going to handle this, or you're going to have to unwrap an option or whatever. So Rust, I think at least surfaces this concern. But if you just slap a question mark after everything, then: oops, no seconds? We're done here. And one second out of every 60 of every minute, we just fail to retrieve the time.
Maybe your program's cool with that.
Amos Wenger: I guess this is the whole point of your presentation is that on the one hand, you can now have partial understanding. You can like some fields, you don't know how to decode, but at least you got the timestamp or something. But on the other hand, now you have an explosion of combinations of cases to deal with.
And it's true that there's- I see the parallel between languages like Rust that force you to deal with like error, no error, none or some. As opposed to languages that will somewhat work up to a point, but just like propagate null or NaN or whatever, well, NaN is a problem anywhere. But that partial functionality thing... I remember people being upset coming from PHP to a compiled language because they were like, "Well, but I liked it when half my website was broken because at least the other half still worked! Now everything's either broken or everything compiles..." which is a really different way to approach things.
James Munns: Yeah. I have an extremely strongly typed programming language brain when it comes to these things. So I think you can definitely pick up on my biases when I'm talking about- I mean, I'm kind of picking on JSON here because all of these things that I'm mentioning are really just like a tiny version of the kind of problems you can run into JSON and things like that.
But... JSON is by far not the only offender here. And as you mentioned, different languages and different protocols run into this. Like the zeros of a value mean something is directly picking on ProtoBuf, which is another one that I'm going to throw some stones at.
But the point is, it's not necessarily bad. Like, don't get me wrong. It is not a purely good, bad decision here. It is just: you should know what costs you're signing up for when you say, "Ah, I will choose this because it gives me this flexibility," of what that flexibility really costs.
Self describing formats - key:value stores
James Munns: And the thing is that when you switch to these self describing formats, because they have all this flexibility, essentially every single self describing format is a key value store.
At least the common ones that I'm aware of today is essentially they all boil down to a very hash map or dictionary or key value store sort of interface in that you're not getting information, you're getting a mini database that you have to query and deal with what happens if your queries fail.
Even binary formats like ProtoBuf. It's still a key value store. The keys are integers instead of strings, but it is still a key value store, which is how you're allowed to have bonus fields and fields that you move around and things like that.
Amos Wenger: I think the main exception here would be columnar formats? Like arrow, parquet, and there's a new one I keep forgetting the name of... where they actually, they don't describe each record, they describe the entire set. And then there's like: here's all the timestamps for all the records, and then here's all the names and all the descriptions.
James Munns: That's fair...
Amos Wenger: Those might be different, but they also have different use cases and I think neither you or I have use cases for them. So, yes, in our little corner of the universe, I agree.
James Munns: Yeah, my big data is a couple megabytes instead of, you know, Apache Arrow.
Like I said, this is not just throwing stones at JSON or ProtoBuf. I mean, it's JSON, TOML, YAML, ProtoBuf, CBOR, ASN. Like, doesn't matter whether it's a binary format or an ASCII format or a UTF 8 format or if it's human readable or if it's not.
If it is a self describing format, you've chosen that for the flexibility and that flexibility comes with all of these regardless of how fast it is to serialize or deserialize. You are serializing a view of an object basically and when you deserialize that you have to deal with the externalities of that.
Amos Wenger: Can't help but notice you misspelled "MessagePack" in there, you wrote it "CBOR," I'm sure that's a mistake.
James Munns: MessagePack... I have to remember, I'm not sure if MessagePack is self describing. CBOR is like the binary version of JSON and it has all that same flexibility that JSON has, just in a much smaller form.
Amos Wenger: I think the main difference is that CBOR has like an actual RFC or something? But it's otherwise very, very close to MessagePack.
James Munns: Gotcha. Yeah, I know Cap'n Proto and Message Pack. I always forget the details between them. I think Cap'n Proto is not self describing, but... I included the ones that I knew were.
Amos Wenger: I forget what it does. All I remember for Cap'n Proto is that they have a picture of like the captain on a cereal box and then they're like, "Infinite speed! Because you don't have to decode anything..." and I'm like, "That's not really... but sure..."
James Munns: It's the same trick that rkyv pulls.
Amos Wenger: Yeah yeah. And abomonation and a bunch of other zero copy things.
James Munns: Yeah, abomonation's extra special. That's where the whole extra thing, but- we're sticking with a self describing formats when I'm throwing stones. So we're talking about these formats: it's all of them.
They allow failable queries and bonus data
James Munns: And for better or worse, this is what we like about them.
We want to have config files where we can leave out the ones that are default. And we just admit that we have to define what default values of all of these are and... we have to admit that there's certain fields that we care about and certain fields we don't, and we just deal with that because it makes the user experience of one frame, whether that's a config file on the disk or a message we receive back from GitHub's API, what we care about.
For better or worse, they all allow failable queries and bonus data and missing data and things like that which means
Did this switch actually help us?
And the answer I would probably say is maybe. It helped us if we realized what we were signing up for. But if we just switched our code over from saying, " Grab integer, grab integer, grab integer," to, "Parse message, grab H M S," are we better off? Well, we could handle some cases better, like with bonus data, but if something's missing or the wrong type, we're back at square one. We just go: I have no idea what this message means. I failed to obtain this timestamp.
"But you could just..." and if we have the discipline
James Munns: And there's going to be a lot of people who are going to be like, "Well, you could just-" and I'm sure the Go people would say, "Well, you just trust zero as a magic value. And you would just do this." It's just gets into "No true Scotsman" of: yes. If you were just git gud enough and you handled all these edge cases, you would be well sorted. And I think that's a reasonable position to take. But I think that most people don't necessarily take all of these things.
If people were already such pros that they have the discipline to keep messages semver compatible or whatever you want to call like that, we only make non breaking forwards compatible changes.
If you already have that level of discipline,
then couldn't we just admit that we have more than one kind of messages and just admit that this isn't the same message as the one that we were sending before and put that on a different API or a different version ID of this or something like that. Where we just say, "There are two different messages," instead of just saying, "Well, there's only one kind: JSON." But then the semantical shape of the message that I'm sending becomes load bearing versus just: oh, I'm sending you JSON versus something else.
Amos Wenger: I see what you're doing here: you just saying all that because Postcard is not self describing. I see what you're selling.
Self describing formats have benefits, but costs are nuanced
James Munns: It's certainly what I'm doing. Yeah, yeah.
And I don't want to get it wrong: self describing formats have benefits,
but it's important for me to realize the costs are a bit more nuanced. And yeah, this does exactly come from Postcard because Postcard's non self describing and furthermore, it's very easy if you were to send a message of one kind and to receive it in another way, there's a good chance that it would succeed in deserialization, but it would be wrong. Like, if you swap two fields, and I was just going integer integer, and you read those and those are both integers, I might say: cool, the minutes are 57, and the seconds are 4, when it's really the other way around.
And this post is exactly me trying to work that out, because at some point I went: you know what? Maybe as a non self describing format, that isn't what I want. Maybe I've tried to push performance too far, and the user experience is just not good. And I was actually trying to figure out how I could make a self describing format version of Postcard and what that would get you, and...
It was then that I realized that the amount of changes you'd have to write in your code to handle all of these edge cases was a lot. Where instead of just saying: did I get the message or not? Did it decode or not? I have to start handling failable queries and my accessors get much larger in terms of code and cost and things like that, because I'm not running linearly through the encoding and decoding. I am storing it in some kind of collection and querying it.
Self identifying format
James Munns: A lot of this ended up to... I don't know if it's an invention or just something that I didn't know about before, but actually sort of a middle ground. So not a non self describing format or a self describing format. But what I've sort of been calling a self identifying format.
So you don't necessarily have all of those field items in there, or you don't have the flexibility to skip fields or remove fields or things like that. But instead, you at least have a unique tag on a message so that our two senders can cross check with each other of going, "Are you sending me an hour hour, minute minute, second second?"
And you can say, "Yes, I am looking for an hour hour, minute minute, second second." And the interesting part that I'm doing is instead making the self describing schema part of these things a side channel. Where if you and I know that we agree with each other, we just put that unique tag in every message so that we can cross check.
But if all of a sudden I send you something that you don't understand, you can go, "Wait a minute, you just sent me tag FAB. What is the schema for FAB? I don't know what you're talking about." And maybe you could have some slower failure path where you do fall back to a more self describing format where you use the schema and the blob of message and get back something that looks like a serde-json Value that you can query and maybe recover from. Or you just at least know up front: I'm not going to misunderstand this message. I will just immediately reject it and know that it's not what I'm looking for.
Research arc from that post from a year ago
James Munns: That's sort of the whole research arc that I went through and when I wrote this post a year ago, I didn't know where I would end up. I went back and looked at it and it's very funny to look at all these conclusions that I drew and went, "Well, I don't know what you do in that case," and realizing sort of the: what checks the boxes for what problems people with Postcard actually have, which is detecting when things change, or realizing whether you need to use a different format or at compile time, even having a CI check that says, "Did I accidentally change the message? Is this going to break all of my devices in the field?"
Amos Wenger: A la cargo-semver checks, yep.
James Munns: Yeah, exactly. And that was sort of the middle ground that I reached. I don't know if I want to pay for self describing all of the time, but I want people to have the option to detect it, because they should be, and then maybe a slow failure path... because you might have sort of asymmetrical systems where our tiny embedded device, we don't want to burden it with the ability to send 15 different kinds of messages, but our desktop server doesn't care if it's got 57 flavors of decoding. That's trivial for the application that's running on your desktop.
That allows your microcontroller to fly while still your desktop is flying, but that's just because it's a thousand times bigger and faster and you can afford the checks in one place, but not the other.
Conclusion: I have something that works
Amos Wenger: So your conclusion is that you don't have a conclusion yet. It's an open research problem. And this is mostly... you've been thinking about all these things. Am I wrong?
James Munns: It's, it's, it's, it's... I have something that works.
Amos Wenger: Oh, you do?
James Munns: So I've talked about Postcard-RPC in the past. This is the trick that Postcard-RPC pulls: to generate those unique tags for each kind of message I hash the schema and use that as a small tag in every message and I have the ability to serialize my schemas and so the system that I'm building now on top of Postcard-RPC allows you to say, "Please give me all of your schemas," And I can get the full, like, OpenAPI description type thing from the device and that allows my server to handle messages it doesn't understand.
You're still limited what you can do with that. You're either querying specific things or you just store it, or you forward it. Where I still can't tell the difference of the key "temp" versus "temperature" but if I'm just sticking the message in the database to pass it on to someone else later down the road that's probably fine because they'll know.
As an intermediary, I don't necessarily have to understand everything that transits through me. So I have something that works.
Amos Wenger: Right. But you could do that even if you didn't have the schema, right? You could just like pass on the bytes unchanged, but what the schema would let you do is, for example, log it in a structured way. We got that message. We don't know what it's about, but we know it has these fields. This one is text. So maybe if a human looks at it, they can tell what's going on.
James Munns: Yeah, either dumping it to logs is great because you don't have to understand things just to convert them to strings. You can validate that messages are still good. Like: Oh, is this a poorly formed message? I'm not going to proxy this message because I know it's poorly formed and I'm not going to waste the embedded device's time with a bad message.
Or you can transpile or transcode the message to JSON because if i have the schema I can actually convert this non self describing message into a self describing message because if someone's getting this message from Python, maybe they would like JSON more than they would like Postcard. And so those are all sort of the capabilities.
I still need the follow through
James Munns: I need to do the follow through still. So I have all of this and it does work and it's very neat and I'm very excited about it. We'll see if it ends up having edge cases that I just haven't run into yet, but I think it's an interesting middle ground. Not all the way in one camp and not in the other, but still checking the boxes we want to check while also not signing us up for every message can be queryable because the message is still the message. You just know whether you have a good one or a bad one at this point.
Amos Wenger: I think it's just another case or a similar concept to denormalization. No, normalization. I forget which direction goes where. So the basic idea is that: the thing I have in mind is JSON API, because a bunch of APIs will return, I don't know, a list of articles, and then there's your user field. And then they're all articles from the same user.
So they just duplicate the information about that user for every article. They have 10 articles, they have 10 copies of the user object. And then what JSON API does is it says: okay, that field is of type user. So actually, we're going to give you an ID. It doesn't need to be globally unique or anything.
It just needs to be unique to the document they're sending you. So you're going to have all the articles and then separately, you're going to have an array, or I guess a map of users. from that ID to the actual user data. And then it's very, very annoying to deserialize that or decode or like destructure it, especially in Rust, which is very strongly typed, but in other languages, it's fine.
You just cast everything to an object. That's, uh, that's how it works. That's, that's the thing that the Patreon API... RIP, it's not, not been maintained for a while. That's what they used. It makes me think of the same thing. I have the parallel in my head because in JSON, yes, every object in an array where every object is the same type, they all describe themselves exactly the same way.
And it only is useful if, I guess, some fields are missing in some of them. But I would much rather have one description at the beginning of the array, and then just all the data at once. Which is not even what you're doing. You're doing a third option yet.
James Munns: Yeah, but I think there's a lot of value to having these schemas because you can start doing transforms and things like that. One of the things that I also in the future want to research is the ability to have automatic migrations or whatever you want to call it. Where if someone sends you a message of a different format that's missing a field, and I know the embedded device expects that field, and I've got some metadata of going: well, we need to make sure that we insert this extra field, at least with like a none in it.
So it might just be: we don't have the data, but I know that I'm always allowed to upgrade a message. If the only difference between these schemas are this has an option field or not. And I know that option is defaultable or nullable or whatever you want to call it, and I can just actually transcode your binary message to a different binary message at the cost of having to decode and reencode it. But like I said on my proxy that might not be a significant cost, but allows me to then not have to upgrade all my firmware devices to speak some new protocol or update all of my clients to change.
You get sort of this interesting abstract transform that you're able to do. But this is a little beside the self identifying point and just the value of having schemas and being able to have like a reflection. So by building all of this, just to have the schemas so that I could hash them, I essentially had to invent reflection for Postcard messages so that you could get the schemas.
And now that I have those, I've started figuring out all the interesting things you can do with reflection and thinking of messages as transcodable formats instead of just saying: the message is open ended and can change at any time, I have essentially reintroduced strong types into my wire format.
Amos Wenger: I'm very excited to actually play with that. Because, yeah... we've gone from XML: which had some of these things baked in it has schemas and everything, to JSON: which is anything goes, fully self describing, to: okay, maybe we need schemas for JSON too, but now they're out of band, to like: oh, we need a way to discover API endpoints, so let's make a standard for that.
And then that's also just a bunch of JSONs. We also need schema for those and it's all meta. Everything is JSON. It's all a big soup and it doesn't feel... I don't know. It doesn't feel good.
James Munns: For sure. It's still an area of active research, but I'm gonna be launching something pretty cool with it soon, so I'm sure I'll talk about it more then.
Episode Sponsor
Descript is a tool we've been using to do most of the production of each episode: editing audio and video has been a breeze simply by editing the transcribed text, like in a document, inserting slides exactly where I want them as super easy by dragging and dropping images or videos and creating templates and layouts as well as cutting or combining compositions makes the production of each episode, smooth and simple.
There are many more features to explore, so check it out for yourself for free by clicking here. And if you decide to upgrade to a paid plan, a portion of the purchase will support this podcast.