How is software safety certified?

James Munns: So like every one of my actual good blog posts, I had someone ask exactly the right question that tickled my brain in a way that made all the information fall out in like a coherent way. So instead of just like an unordered web of random facts that I know, I was like, "Ah, I can imagine the person that asked me this question and how I should answer this question for that person." And the question they asked me was, "Hey, you do stuff with safety critical stuff, right? How is software safety certified?" Because like I've heard people talk about that, but I have no idea what that actually means.

Amos Wenger: I love that the subtitle is "A Crash Course" because... Is that a pun? Is the pun intended?

James Munns: Yeah, exactly. Yeah, "Crash Course" is even probably a little bold, but it's just like common misnomers, like all the stuff that I either hear people say and be wrong or like that I always get asked and I'm like, "Okay, I'm going to actually..." So this is probably going to turn into a blog post as well with like a more written out request, but I figured you're the kind of person that might be interested in this.

Amos Wenger: This is what SDR is for, right? It's like pitching blog posts to the co-host and seeing if they would actually work.

James Munns: Yeah, because it's faster to write bullet points in a slide than it is to write like a well-reasoned defendable blog post because people on the internet. That's what I've been doing. Yeah, exactly.

Amos Wenger: Some things I pitched to you, I see the reaction, I'm like, "Oh no, I'm not dealing with YouTube comments on that. You're right. I was wrong. Never mind."

James Munns: Nice. Okay, so I get asked about this pretty often. I come from a background of safety critical and especially nowadays that safety critical Rust is a thing that you can be doing today after a lot of hard work and effort and time. More and more folks, especially in like the Rust scene that I'm hanging out in are like, "Hey, what does safety critical actually mean?"

Amos Wenger: Just to clarify, you mentioned a lot of hard work. You're talking about hard work that's already been done by other people and are unlocking these use cases now, not hard work that you need to do in order to get anything done in that area, right?

James Munns: Correct, yeah, that work. The work has been done over years. We'll talk about this a bit more later, but safety critical is an industry that moves slow. Getting them to adopt some new tech, whether it's a language or anything, takes time because they're conservative. I'd probably argue reasonably so, but they're also open to like, "Hey, how can we do this better?" That process takes a while. For a new language that could be used for automotive or avionics and stuff like that, that's something that hasn't happened in literal decades. C and probably Ada have been the go-to languages for safety critical for a long time with a little bit of C++ in there, depending on your industry. Time since a new language has been introduced in safety critical processes probably 20 years, not exaggerating probably like mid to late 90s kind of stuff.

Amos Wenger: I see. I mean, that seems like a long time, but I also realized recently that I was 17 years old 17 years ago.

James Munns: Okay.

Amos Wenger: Because I was looking for music for my teenage years. So my point is, if you're saying that a new language hasn't been adopted in 20 years, I was probably already coding back then-- oh, maybe not, maybe not. Maybe I'm just like just shy of that, but it's a long time, but I'm not getting any younger. Even though everyone says I'm a Zoomer in the YouTube comments, I'm not. I'm a millennial and I'm mildly neutral about it.

James Munns: Time marches forward. And these days already, you can get three compilers. There's not just a safety critical Rust compiler, there's three people offering a safety critical compiler. So you'll see this on product pages. So there's Ferrocene from the Ferrous Systems folks. AdaCore is adding Rust support to their existing Ada...

Amos Wenger: I don't know. I have no idea.

James Munns: I always forget.

Amos Wenger: You're the authority on this as far as I'm concerned.

James Munns: Yeah, yeah, should be. They're adding it to support to their tool chain. High-Tec is another one. They make compilers for a specific flavor of very popular chips that are used in automotive. And I say very popular. They're very popular in automotive and nowhere else in the world. So if you haven't worked in automotive, you probably have never used the Infineon TriCore processor from Infineon.

Amos Wenger: I sure have not, but not the target audience for this. But yeah, when we had a chip shortage, cars were affected. But now you're telling me in cars we use things that nobody else uses. So I don't know, like same causes different supply chains or how does it work?

James Munns: It can't. Yeah. I mean, a lot of stuff goes into automotive. So that's just like the CPU that's running code, but you need a lot of other stuff. You need like voltage regulators for power supplies or analog circuitry for comparing voltages and stuff. So I'm not sure if these chips were ever in shortage because they're really only used in these niches. But there's a lot of periphery stuff that goes into laptops and phones and stuff like that, that is more common with automotive stuff. Yeah.

I used to work at Safety Critical. I started my career in avionics. I did gas detection for a while. I've done some industrial like robotics-y kind of stuff. And then when I was at Ferrous, I looked into a lot of other safety standards because we were starting Ferrocene at the time. So like I've done quite a bit of this, not as much recently because I've been more on like connected devices and stuff. So my knowledge might be a little out of date. But like I said, these industries don't move super fast. And we're only going to hit the high notes here on this. This is probably not going to give you enough to ship your first product, but enough to get like the vague idea of how it works.

Amos Wenger: I'm just curious, did Ferrocene start because you had that specific skill set and background or did you join first because you had the background they were looking for?

James Munns: Well, I started Ferrous with Felix and Florian and we were getting embedded stuff going like at the beginning of Ferrous. And at the time I went--

Amos Wenger: I didn't realize you were a founder. That's embarrassing.

James Munns: Yeah. Yeah. I mean, I was there for three years. I've been gone as long as I was there at this point.

Amos Wenger: So I just assume everyone's been there at some point, except me. It's a rite of passage. It's a shibboleth that I don't have.

James Munns: Yeah. I know that this industry moves slowly and someday I want to get back into these industries and things like that.

Amos Wenger: Well, there's no rush, literally. Yeah. It'll still be there when you're ready.

James Munns: I joke about it. Sometimes you have to start five year fights and like sometimes there are fights. You just have to keep working on it for five years before it's a thing. And so I was like, well, if it's going to be a five year fight, we better start today.

Amos Wenger: I just started one of these. I've been working on facet. This is not a facet episode, but--

James Munns: I'm excited for the facet episode.

Amos Wenger: So am I. And we need something from the compiler. So much going on. The compiler team reached out. They were like, "Hey, is there anything you need?" And I made a little shopping list. Like, "Oh, that'd be nice. Like I need that in const and whatnot." And then I asked, "Hey, do you have any sort of ETA for that?" And they're like, "Looking at the track record for this... maybe five years. So, you know, staging for season three of Self-Directed Research in 2030.

James Munns: But I mean, it's what-- you got to start. And it's one of those if you want the industry to move, someone had to push it. And so I started just yelling about it. And I talked to people for like two years and it didn't feel like there was a lot of progress. And then right as I was actually leaving Ferrous is when we started really getting some traction and people started to be interested, especially in automotive. And so all the work of actually following through with this was not me. So like I came up with I was part of planning the initial plan and like talking to people and selling them on the idea and having them in a lot of industries, you just need to not surprise people. So like you kind of have to talk about stuff for years before they go, "Oh yeah, I've heard of someone talking about Rust for safety critical. That makes sense." So they don't have that immediate rejection. Like, "Ah! That sounds silly. I've never heard of that before."

Amos Wenger: That's cool, right? The senior work pay off, even if you're not here.. 'here-here', because I've left companies and then things kind of fell through like all the things that I tried to start and that doesn't feel good. So I think I would be happy to see that all my work paid off even if I'm not there anymore.

James Munns: Very much so. Yeah. Like I said, it's what I wanted to happen even if I wasn't the one who was going to do it. But at the time I couldn't think of anyone but me who was going to do it. So, you know, you got to pick some fights.

There is no "safe" and "not safe" binary

Amos Wenger: So James, I can see in your notes here that safety is like me. It's non-binary.

James Munns: Yes, exactly. So this is another one of those things where people are like, Oh, why don't we just have safe software? Why like, there's just safe software and there's not safe software. And that's one of those things in this process where there really, there really isn't a binary because there's no like, if you were thinking about something like a bolt that goes into a bridge, what does it mean to have a safe bolt? Does it hold the capacity that it's rated for?

Amos Wenger: Yeah.

James Munns: Does it resist corrosion if it's going to be outside for 40 years or something like that?

Amos Wenger: Yeah, you have to think about the whole system and all the constraints. And yeah.

James Munns: So I mean, like a component can do what it says it does, but it's not like in and of itself safe or not. Safety is a property of like the whole thing. Like the bridge is safe because we picked bolts that are corrosion resistant and, you know, girders and concrete and whatever. And we set up a schedule for maintenance. Like that is safe.

Amos Wenger: I'm so glad you picked bridges because nobody gets mad when we compare bridge engineering and software development.

James Munns: And that's a whole other thing is that that software is just orders of magnitude more complex and newer and less studied

Amos Wenger: Than bridges?

James Munns: like bridges, a lot of mechanical, electrical, chemical engineering. We haven't figured them out as much as we have other fields of engineering. And I think we will over time, but also software is just trickier.

Amos Wenger: So it's not that we suck at it. It's just harder. Yeah, I see. Yeah, that makes me feel good somehow. It's good podcast. I like it.

James Munns: And the other thing is there's no one size fits all approach. Like safe means different things in different contexts. What you'd put in an airplane, there might be different levels of like, hey, this is what keeps the airplane in the air versus, hey, it would be really annoying and distracting if this failed, but it's not going to like ruin anyone's day kind of thing. So like, it really depends on what you're doing, the industry or even like the country that you're in and like how you're going to use the thing. So you can't just be like, I wrote a safe operating system scheduler, like because safe is a property, like it gets too crushed down to just one word of safe or not.

Amos Wenger: And I would imagine that there's a cost aspect to all this that we probably hate to think about. But let's say you're one of several companies offering something, you have to win the contract, right? I would imagine. So if you go too far at the end, we made everything 3x what it needs to be, then it's too expensive. And then your thing never even gets on the market. Is that realistic at all?

James Munns: Exactly. For sure. Like there's material engineering that goes into bolts and stuff like that. And then like, how large or heavy they are is important and stuff like that. So you could have big chunky, you know, 30 centimeter thick bolts for everything, but then the bridge won't work because it's too heavy. And the same way, like with software, you could spend five years shipping a component. But if the people don't care, like if you're not doing it for that level of criticality, it's a lot of extra work and formalism. Like you could be doing 20 other things.

Amos Wenger: You could write a web app in Rust. Yeah.

James Munns: Yeah, exactly.

Amos Wenger: That's what I spent five years doing. It's working.

James Munns: Hey, it's research. With research, you know, there's no, yeah, no whatever. This is more... integration is not the right word... practical application, applied research kind of stuff.

Amos Wenger: It's just if you don't get money out of it, just start a podcast about it. That's not true. I get money about it.

James Munns: There's a whole area of research that I think is interesting, which is research in the open for content, which is basically what we're doing. But that's a story for another day.

Amos Wenger: I mean, it could feel better doing it for content is not a sentence I like to hear, but we're not like doing pranks or something. We're just, well, okay. Some of my Rust projects, maybe, I don't know. Well, merde is dead.

James Munns: Yeah. I mean, there's something to be said about like, you're always doing research for someone. Doing research with the goal of making it understandable and entertaining is one way... it's an alternative to writing grants. You know what I mean? Like the alternative is convincing a company that it's in their R and D interest to do that. Whereas like, if someone's learning metallurgy and has a YouTube channel about it and they're doing research and they're publishing the results and things like that, but the way that they get paid is not by a company paying them to do that research, but just because the research that they're presenting is entertaining enough for people to follow, like the folks that listen to this podcast, then it's interesting to me. I don't know if it would apply to all fields, but I do think it is like, you know, that or folks who like run museums and stuff like that. And they appeal to a wider audience by putting stuff online where they can increase their audience and get funding for like a regional museum that would never have that many people's eyes on whatever their museum is building. You know what I mean?

Amos Wenger: I think you're right on as to the benefits we get from this work being public, but for me, I think it's also a way for the both of us to find ourselves in a call and discuss things. And I think facet has been shaped by season one of SDR when we talked about serializers and I was like, how much can we do with declarative macros? And then I was like, wait a minute, we don't need declarative macros. And we were like, what about stack based stuff? And now I'm like in Facet, I want every serializer and deserializer to be iterative and not recursive. Because why blow up the stack if you don't need it? You can make your own stack. It's fine. We need to focus. Back to the topic.

Functional safety

James Munns: Ok. So, the term that you're going to hear in a bunch of this, it's not the like concrete definition, but you'll hear the term functional safety. And most industries, their approach to safety critical is based around the concept of functional safety. And functional safety is too broad for me to go into a definition or where it comes from or anything like that. But I'm going to pull out the parts that I think are most core to the whole concept, because we only have so many minutes here.

Amos Wenger: Is it basically checklists, like surgeons have checklists?

James Munns: We'll get there.

Amos Wenger: Ok.

James Munns: So the first principle is that failure is statistical. Everything can fail, there is no such thing as a perfect component to some level of within 1 million hours, this many percent will fail or things like that. All failure is statistical, like any single component, whether we're talking about a bolt or a spring or a line of code or resistor on a circuit board, or the humans that are reviewing all of those things. Everything can fail and will fail. There is no perfect infallible fail 0% component.

Amos Wenger: What's the company that publishes numbers for hard drive failures? Backblaze. Backblaze. Yeah, it's a fun read.

James Munns: Yeah, you go, ah, well, I will have fail safes, like I'll have a fuse, even fuses fail at some point, we designed them so they fail very rarely. But even some fuses can fail in a way where they don't actually break the circuit. And so they keep doing that or they have an arc across them or something like that.

Amos Wenger: For the people who don't know any electronics at all, right? Any electricity stuff

James Munns: Yeah. So normally you have a fuse, which usually is in most cases, like a little barrel looking thing that just has a very thin chunk of wire in it. And when you put too much power through that wire, it melts and it breaks or, you know, there's a lot of different ways you can build fuses. But the idea is that if you have some problem, like a short circuit where power is flowing in a way that it shouldn't, instead of burning the component up or even burning your house up, you have one component that fails above some limit.

Amos Wenger: Because it's designed to fail at a certain level, because like the wire is a certain length, a certain diameter, more importantly in certain material and whatnot. Is there gas in like the chamber?

James Munns: It depends. So some of them are just like in a glass tube so that when it melts, there's nothing there to like continue completing the circuit. For the higher rated ones, they put like sand in them so that the metal because like one failure mode of fuses is if they fail explosively, you can vaporize the metal. And what it does is it deposits all that metal on the glass tube and the glass tube keeps conducting electricity. So the higher rated ones will put sand in there so that when the fuse breaks, the sand fills in the gaps and it won't conduct. For the really high voltage ones like the ones on power lines, they actually have explosives that go off. So it throws the wires apart from each other so that they can't continue arcing and things like that. So this is exactly that layer of like--

Amos Wenger: That's amazing

James Munns: depending on how important it is. You might even design a fuse in very different ways.

Amos Wenger: Right.

James Munns: The important part is that everything can fail. And we're just talking about how do we make things fail less often, because we admit that it can never be zero. So how do we say one of them fails every thousand hours? How do we take that to how does it fail every million hours or something like that? We're reducing the leading zeros in the failure rate.

Amos Wenger: And like before, you have to you have to take the whole system into account, right? Because it's the weakest link kind of deal.

James Munns: Exactly. Because systems are made out of components and every component can fail. And no system can fail less often than how often its components fail. Like a bolt that we were saying, we can't say the bridge lasts longer than the bolts are with some caveats of if there's redundancy or things like that. But in general, you can't have a system that's markedly failing less often than the components that make it up.

Amos Wenger: Yeah. Speaking of bridges and maintaining them, I was watching the XKCD "What If?" they have video series. And one of the what ifs was what if the sun went dark? And mostly people are like, well, we were all freeze and die. But here they had like nine upsides to the sun going out like no time zones. So you can just do business. You know, everyone is on a coordinated time zone. Everyone's on UTC. And one I didn't see coming is you don't need to maintain bridges anymore, because the oceans are frozen. So you just lay down road on it and just cross.

James Munns: Everyone just gets spiked tires.

Amos Wenger: Exactly.

James Munns: And since we are talking about safety critical, because all these things that I'm saying so far are true everywhere, like you do this in all engineering. Hey, how do we make sure our washing machine lasts five years or 10 years or something like that? You've got to think about, hey, what can fail? Like what goes wrong on that? But the difference with safety critical is that failure has a very specific price. And in safety critical, those prices are deaths, injuries and financial damage in that order. And we say that deaths are more important than injuries. Injuries are more important than financial and financial is something like property damage or equipment damage or anything like that. But we really have to say like, no, really, if stuff goes wrong, people could die.

Amos Wenger: Yeah, but thanks to some lawyer, you can put a number on that as well. I forget his name.

James Munns: You can, I mean, whether you put a number on it or not, like those are the stakes that we're working with here.

Amos Wenger: I agree with the order for that matter. Yeah, yeah.

James Munns: Yeah. So because there's those tiers to those, there's different levels of diligence that you might do where you go, look, if this piece of software is life or death for someone, we're going to put a lot more effort into it because we know like how serious the stakes are versus like you were saying that might be doing that for something that's just an inconvenience for someone or you go, oh, it might burn out a motor. And that's annoying, but whatever. Like you might put a different level of diligence on that, whether that's like what you do because you're doing your best or like what you're legally required to do before you sell that thing.

Amos Wenger: I have two things to say. I cannot stop interrupting you, James, today. I'm just it's been a while. I've missed you.

James Munns: I've missed you too. It's not a monologue. It's a dialogue.

Amos Wenger: It is. So the first thing I wanted to say is that one of my favorite podcasts is, "Well, there's your problem." And if you're interested in engineering disasters and very, very long discussions with slides, we stole that from them. So thanks for the, thanks for the idea. You should listen to "Well, that's your problem" podcast. WYTG, whatever. You get it. And the second thing I wanted to say is, is it true that planes used to have more engines because we didn't know how to make them more reliable, but now they only have two because even if one fails, it can flow like you can fly with one. Where did I hear that?

James Munns: Yes, I might've mentioned it, but I mean, there's a standard called ETOPS, which I can't even remember what the actual acronym is. It's like enhanced something, something, but the joking name is engines turn or passengers swim. Because this is the standard that you have to follow. Because when you're flying like over the ocean, where if something goes wrong, you can't turn around and quickly land at the closest airport. You're like three, four, five hours. And ETOPS is usually rated by how far you are from the nearest airport. And like you're saying, back in the day, we used to say for ETOPS type stuff, or for long range flights like that, you wanted at least three engines or ideally four so that if one or two of them failed, you could keep going into your destination.

But now we're at the point where both engines have gotten more reliable and engines have gotten more powerful. So your typical two engine plane can actually take off with just one engine. And so we've said that they are now high enough reliability that the chances of two of them failing at the same time is low enough that we can now have planes that are allowed to go over long distances that only have two engines. But yeah, it's exactly this kind of like, we figured out how to do it better and the failure rate is low enough where it's now an acceptable risk for the gain that it is not having four engines.

Amos Wenger: The taking off was just one engine is breaking my brain because I'm still in game dev mode from when I was a little younger. And I like imagining one point where you apply force and the plane going, weeee, spinning around out of control. Does that mean that they choose the altitude of the plane? Like what if there's no airport within one hour flying? Or if you go over the desert or something?

James Munns: What do you do? If you are flying that route, then you have to be in an aircraft that is rated for that. To land in the desert. Okay, if you're flying far enough from an airport, you have to have an aircraft that is rated to be able to handle like one engine out with no problem. If both your engines are out,

Amos Wenger: Yeah, you need to crash land. Yeah, yeah.

James Munns: Oh, well, you're done now. Like you will just land as best as you can. There's no real rating for that. The point that we're at is like statistically, the chances of both engines going out is so like one in a million, one in a billion, that it's low enough to be palatable because this is that statistical thing. Like we admit no failure can go to zero. We just make sure that 9999 out of a million times it doesn't happen.

Amos Wenger: I'm reminded of that time where someone close to me had a medical emergency in a plane and they were just flipping through pages like, okay, it's not that it's not that I was like, I'm gonna call the doctor on but I was like, yeah, so there's a doctor in every flight. And they're like, no, no, no, we just, you know, see if there's...

James Munns: see if there happens to be a doctor on the flight.

Amos Wenger: Yeah, exactly. Like in the movies, they're like, sir, there's a protocol, we're just following it. Okay. Death injuries, financial, are the three tiers of failures. No, the consequences.

James Munns: Exactly.

Amos Wenger: Yeah.

James Munns: Simplified a bit. But yeah, those are the three things that we're usually thinking about. And when we know that failure is statistical, and when we know that there's a price to failure, now we have to reason about failure as an engineering problem. How do we decide what the acceptable limits for exactly that of like, how many one out of a million, one out of a billion cases do we allow? Because there's always going to be some because we say everything could fail. It's equally possible that all four of those engines just go off at the same time. Or it is possible that that could happen.

Amos Wenger: But there's not just one lever you can action, right? Because okay, you can have the fanciest pieces and the fanciest parts and make it as resilient as possible. But you also have like maintenance. So you can schedule maintenance at different intervals and

James Munns: Yeah, and safety critical, like when you're talking about the safety of a system, all of that goes into it. That's why there's no like, this is a safe engine, right? It is a safe engine if it is integrated the way that it's supposed to be operated in the range that it's supposed to be maintained in the way that it is supposed to be.

Amos Wenger: So if someone were to buy an airline and strip mine it, that would be very bad.

James Munns: It would be.

Amos Wenger: Good thing that never happens routinely.

James Munns: Never ever. But the thing is like aviation is probably one of the safest industries because they are typically so strict about that. At least in the US, if the FAA doesn't like what you're doing, they have the power to just say, okay, you can't fly like both to like, a carrier like Boeing, they could just say, okay, you have 48 hours to decommission all planes, and you can't take off with those planes again until that's resolved. Or like this airline hasn't kept maintenance. That airline is not allowed. Like they can end you like almost no other regulator can in the US.

Amos Wenger: So you're talking about the time that the US still had airline traffic controllers.

James Munns: Yeah. Yeah. Yeah.

Amos Wenger : Sorry.

James Munns: You're gonna make me sad.

Amos Wenger: I know. It's true.

Address risk the best way we know how to

James Munns: Let's talk about engineering for now at least. Because that's what I could still do something about engineering. Yeah. And so the whole point of this is to make engineering decisions about like, hey, what is our expected level of failure? How can we reason about that, decide whether that is reasonable or not set some standard about what is reasonable and not and then as often as possible, reduce that where we go, okay, if you are going to require this level of reliability, you have to be doing these things to keep that reliable. And at the end of the day, we're just trying to address risk the best way we know how because this isn't an ideal system. It's an engineering statistical system, which means there's no right answer. There is just the best way we know how to today. This is tailored industry to industry. Like what you do in automotive is very different than you do in avionics. Because if you have a problem with your engine in a car, you can pull over. And if you have a problem with your engine in an airplane, you can't pull over as easy. And if a plane crashes, there are hundreds of people on board versus if a car crashes, there are single digits often, or if it's a bus, maybe double digits, but much less than a plane.

So the level at what you set certain thresholds is going to be different industry to industry. And also just the concerns that they have automotive has more parts for more vendors. So their safety standards takes into account how do you deal with having tons of different vendors for all the different pieces of the engine, whereas there are relatively fewer parts in a plane from a fewer number of suppliers and things like that. So these different industries will have customizations to this approach of functional safety that is more tuned to their industry. So that's why there's different standards for all of these industries. There's no one safety critical standard. They all borrow 95% of how you approach safety statistically, at least at the base layer like this. But what they expect you to do and functional safety is generally common to most of them. A lot of them actually just derive from the same standard. IC61508 is the definition of functional safety. It's typically used for industrial, but even automotive has a standard called ISO 26262, which is basically like a themed version of 61508 for automotive.

Aviation is a little different because it actually predates 61508, but 90% of what they ask for, the gist is the same. The forms they ask you to fill out are different, but the kind of things that they're asking you to check against are the same concerns. So as you can imagine, functional safety is very process oriented. It is all like you were saying, "Oh, how many forms do you have to fill out?" Or what kind of check boxes do you have? It is very, very process oriented because generally we've realized in engineering, this is the best way to get reliable results out when you have a reliable process of doing things. It has defined ways of saying what you're going to do and how you're going to do it. So you specify both what you're building and how you're going to build it. You make sure that you do what you say you are going to do. So you implement code and you implement process, and then you make sure that you did it the way that you said you were going to do it. So that if someone goes and checks the way that you're doing things, it actually matches reality of what you did.

Amos Wenger: And that's like requirement tracking that you were talking about. Exactly.

James Munns: In another episode. Yeah, we talked about traceability. And then verification is proving that you did what you said you were going to do. And in that episode where we talked about traceability, that is one way that you make sure that all of that chain is unbroken. That you make sure that if you ever changed what you said you were going to do, you also made sure that reality got updated to match what you were planning to do. So traceability is one tool of making sure these are all connected top to bottom.

Amos Wenger: So how's hooded is functional safety to a couple with trust issues?

James Munns: It may not be the most effective use of your efforts. I think there might be other things that you could do that would be more efficient.

Amos Wenger: It's a neurotypical couple. So I need clear requirements from you. Hey, communication. Like what traceability methods are you going to employ?

A paper trail...

James Munns: This is kind of what I was getting into with the traceability stuff too, though, is like you can do 20 percent of the formalism and get 80 percent of the value. It's like having checklists and having process and writing stuff down and making sure that you keep that up to date is a good thing. And surprise, I'll get to this where this pops out later. And it's not your relationship. But the point is to make a paper trail so that we can figure out what you did. And if we realize much later that, oh, hey, there's something potentially wrong, we can go back and figure out who is and isn't affected, like who is following the same assumptions and what is affected or what isn't affected.

Or if you have multiple products and you find a bug in one of them, we can go back and figure out if all of these products have that same like root cause or commonality to them and things like that. The real goal is to catch issues before they become a failure, because if you catch things before a crash happens, that's a win like you caught it. It got fixed, whether that was when you were in development or after you shipped before it went out, like even in those cases where sometimes you'll have like, oh, the engine went out, but we managed to land back. But then we were able to do analysis on the engine and figure out why it went wrong. And then we can go back and go, oh, we have the same issue in all of our process or something like that and catch those things before they fall through.

In safety critical, we talk about the Swiss cheese model, where for something to really fail, a lot of things have to go wrong, because you're doing all of these things to make sure that you're doing things right. And for something to really have to go wrong, it had to be missed in requirements, implementation, testing, usage, no other fail safe had to have caught it. And so it's called the Swiss cheese model. It's because it's like stacking up pieces of Swiss cheese with holes in it. And for an actual failure to happen, you have to be able to go all the way from the top to touch the table. So you're basically giving yourself the most possibilities here to catch things before it's actually a line of holes in a stack of cheese.

Amos Wenger: So the amount of paperwork is by design. It's like redundancy also in the process.

James Munns: The things you have to do and why you do them are there. How verbose it has to be and how much knowledge you have to have to do it is they haven't figured out a more efficient way to do it. You know what I mean? Like the formalism is the most effective way they've known how that doesn't mean it's perfect. But yeah, the gist is there.

Amos Wenger: In the case of like the plane had an engine go out and it was able to recover and still go all the way or something to the destination, you would call that an issue and not a failure. Like this is just because I thought we would catch issues and verification because I'm naive and I never worked in the industry.

James Munns: We said that a failure is a loss of life, injuries or equipment. So an engine failure is not a failure. So that would have been a failure because it caused a damage of equipment, but we would have treated it as a low. But that's not a big deal. Like it is a failure because the engine went out and it is a failure. And then we have to decide was that engine going out just hey, we said it's one in a million. This was the one in a million. You know, there was a crack in the engine blade. They just have a crack after a million hours and it can happen.

Amos Wenger: I happen to have the IEC 61508 page Wikipedia page open and they define some words that are very fun. Frequent is many times in lifetime. So speaking of relationships, if you're having frequent intercourse, you know what it means now.

James Munns: It's a couple times in your lifetime.

Amos Wenger: Well, it's failures for years. I don't know if that's what you want to call it. Probable is several times in lifetime. Occasional is once in lifetime. Remote is unlikely in lifetime. Improbable is very unlikely to occur. Incredible is cannot believe that it could occur. And then the consequence categories is four of them. Then catastrophic, which is multiple loss of life. Critical loss of a single life. Marginal, major injuries to one or more persons and negligible minor injuries at worst. So it would not even mean negligible if there was an engine failure, but they still managed to land where they were going to go anyway. Exactly. Chances are people wouldn't

James Munns: even know on the plane. And that's the system working. That's interesting. Even though a component failed, if that makes sense.

Amos Wenger: Yeah. That makes sense.

Everything can fail, especially people

James Munns: So in the reason we want to give ourselves all that margin is because everything can fail. Like an engine blade can fail, but also the person whose job it was to review that you did it right could have missed something. They didn't have enough coffee that day. They didn't sleep well. They missed something that they had an opportunity to catch. So we give ourselves all this margin. They used an LLM to review the thing. Yeah. Another thing is that only really end products are safety certified.

So like in an airplane, it might be like a weather radar or something like that. You will have a safety certified weather radar that this is a whole component or a whole item. And this is one of those blurry things where it depends on the industry and how it's integrated. So like you might ship a safety qualified thing. It still needs to be integrated correctly into the rest of the airplane. But your real time operating system is not really something that just is safety certified because it's just a component of a larger system. And context matters.

Amos Wenger: As you explained the definition of this, I'm having a special thought for the people who installed the AC in my apartment, and it is blowing cold hair directly out the door of the room. Oh, nice. Therefore, it is completely inefficient if you have the door even slightly open, which you need to if you have cats. So, yes, I know what you mean, James. It's not critical for sure.

James Munns: Yeah, but this is one of those areas where context matters. It's only as good as what they bounded it to be. And when we have things like libraries or tools, they're just a piece of the process along the way. And when we have a programming language or a compiler, those are just tools. Those are a way of achieving the system that we are building, which means they can't by themselves really be safe. But there are safe ways that you can use them to build something on top of that.

What you can do is you can spend a lot of effort and design your library, your real time operating system, your compiler, your language, up to a safety standard. So you can say it has checked all the boxes that you need to. We have documents and requirements and traceability that say, we've conformed up to the standard for this piece, which means then if you're building a larger system, you can make a very compelling case like, hey, we're using tools and components that were up to the standard. So even though we didn't do it, it was done up to a standard, which means that we can confidently use it versus if you just found something on the street and integrated it, you have no idea that that's up to a standard or not.

Amos Wenger: Yeah, I guess that's something that's difficult to conceptualize for people who don't do engineering. But that's also why you don't need to test every single copy of a component that you make, especially software, I guess. But you design it, you test a sample. It's all probability. Not everything you use has been tested for thousands of hours, right? Not the copy you have.

James Munns: And for these really critical industries, especially like government ones, like if you buy bolts from someone, you might destructively test some percentage of those bolts to make sure that the bolts match the spec from the vendor. And then over time, you might sample less of them. But then if you ever find one that fails, you might massively increase. So you might be testing to failure 1% or 10% of your bolts, depending on how critical it is.

Amos Wenger: This is exactly how self-checkout works in one of the stores that I go to. You can self-scan everything, and then they will randomly check you. And if they ever catch you with an article you haven't scanned in your cart, then they will check you a lot more. So it's the same idea. It's statistical.

James Munns: This is statistical sampling. It's exactly the same thing as we have sampling-based profilers and things like that. You hope that you make enough statistically relevant samples and you get reasonable data out, or at least enough data to go on. And then you have to decide how much data is enough data to go on.

Amos Wenger: But it's probably terrifying to folks who cannot... I don't know. Part of me is like that. Part of me is like, "No, we need zero failure, not just the small number. We can't rely on maths for this." But there is no zero failure. I mean, I'm divorced, so I know.

"Just add safety" afterwards?

James Munns: Yep. At the end of the day, everything is statistical. I know. And this is why just adding safety afterwards, especially for any non-trivial component, like a compiler or an operating system or something like that, going back and trying to make a safety case for this or putting together that paperwork so someone can use it is often so difficult that it's just easier to start over and do it the right way from scratch.

Amos Wenger: Well, explain that to folks using C++. This is all I'm thinking about. Do you have something else in mind with that slide?

James Munns: What do you mean?

Amos Wenger: Just add safety. For me, it's like all the papers around, "Oh, we can add safety to C++," but actually not really. And starting over is like Rust or some other languages.

James Munns: Yeah. I mean, what I was going to say is just components, because this is one of those like, "Hey, why can't I just use some library as part of this?" Like, it's good. I've been using it for a while. Why can't I say it? It's because you can't make reasonable statistical assumptions about something that... Because what we've said in the process is if we don't know that you've done it this way, then we can't say that it fails a normal amount for software developed in this way. And when we throws the other math that we're doing to decide, "Is this system good enough?" It throws all of those numbers off, or it just makes it like an unsolvable equation.

Amos Wenger: But for that, like for your whole homegrown network stack that you're making, do you take that into account? Do you like not use certain libraries because they haven't been developed the right way?

James Munns: No. No, because I'm not developing for safety critical. And if I was... Because it's not safety critical. Yeah, it's not because it's one of those things that I might do some of the detailed design work that might leave some breadcrumbs that might make it possible later. But you wouldn't, because even though all the stuff that I'm saying about functional safety is good, and I think it's a good way of doing it, it is overly burdensome for most applications. If the cost of failure is low enough...

Amos Wenger: If you need a light switch that's remotely controlled or whatever, that's... Yeah, you don't need all the safety critical stuff. Because the failure is you don't have light switches.

James Munns: Because that's the other thing, is you have to say what could go wrong. And if the cost... If anyone wanted to use my network stack in safety critical, then yes, I would have to say that. But right now, it's research. So like...

Amos Wenger: It's a good tip about anxiety in general. What could go... Like asking yourself the question, what could actually go wrong? I mean, it's not gonna immediately make your body go calm. But it does... Sometimes for me, it helps to run through the actual scenarios. Especially things in relation with cars. I do that too. I keep reminding myself cars are made for idiots. They're trying to avoid lawsuits. They have standards. The failure modes are not that terrible. And if they are, then it's someone else's problem. Because I'm not here to deal with it anymore.

James Munns: So that helps me a bunch. I mean, that's one of those things that's in this formalism is there's a FEA or FMEA, which is a failure modes and effect analysis. Which is basically you just sit down and you brainstorm all the possible ways. Anything could go wrong at any layer. And then you game it out. Usually in like a tree, basically of like, if this fails and this fails and this fails and this fails, how many steps do I have to go to get to death? Or like, does it ever terminate in death? And then you work backwards. And if you go, oh, if this one resistor fails, death, then you go well, then that resistor needs to not fail more often than this, because that becomes like the limiting reactant. Yeah, for my entire system.

Amos Wenger: It's like a cause. The causal profiler where you can like, if you want to find how to make your program go faster, you can just instead make some components go slower. And then it tells you what actually benefits the most from being optimized. Interesting. And it's kind of the same thing of identifying causality. Like this is the thing that we should invest our safety budget into, like, I don't know, or equipment or parts, whatever. Exactly. I don't have the words to talk about this, but you get what I mean.

James Munns: Yeah. But it's also like a neat approach of analysis, because you might realize, oh, I have a fail safe, so I don't have to worry about it. But then you go through this analysis and you realize that one component could knock out both your primary and your secondary and you go, oh, I guess I have no. And then you have a spuff. And like, it's a balance too, where you go, there's no point in having this redundancy, because it doesn't get me anything, because it only prevents here downwards. And that's so unlikely that it doesn't make sense. So like, a lot of designing for safety is actually throwing out anything you possibly can, because the fewer components are there to fail, that doesn't need to be there. The easier it is to prove that it either does work or things like that.

Amos Wenger: So I guess that's where having nice requirements comes in, because you can actually point back, you know, why do we even have this? And then you can go back and exactly why you do and see if you can actually get rid of it. Exactly.

Working backwards to prove safety...

James Munns: And yeah, so working backwards is almost always harder than just rewriting for any non trivial thing. True. But the Rust compiler is a fairly rare exception. As far as off the shelf software goes, it was a software project that a ton of diligence was done on along the way, there were documentations of what it was supposed to do and what it was not supposed to do. Documentation of the decisions that were made when they were made, who they were made by who they were checked by, there were reviews of code and decisions. There's been continuous integration testing that's automated continuously since it's existed. This has all been in writing in public with a linear get history that you can go back and look at the decisions all the way back to before 1.0.

If that sounds like everything that I've been talking about for functional safety so far, it is. And Rust didn't do this because it was trying to aim for safety critical. It was doing this because it's good engineering. Yeah. And you just do these things for things that are important to get right. Mozilla realized that from like a more safety perspective, or just a good engineering culture perspective. So they had this culture of setting the project up like that, and it actually laid a huge amount of the breadcrumbs. And this is one of those things that I realized when we were having those initial discussions at Ferris of how reasonable is it to have a safety critical Rust compiler. And I go, look, the Rust project is checking so many of the boxes that they're probably doing this better than a lot of folks who are doing safety critical stuff, or at least they have a more cohesive approach to it for doing these kind of things. And it why it was so quick. There was like two years of convincing people that it was a good idea to do. And then once people got on board, it took them about two years from when I left to when they had the first qualified version of the compiler. And like for an off the shelf open source component to go from nothing to safety qualified in that time is like bonkers quick in the safety critical industry.

And it's because Rust had a lot of this safety case already made, which meant you didn't have to go back and rewrite huge chunk of it, you had to define how it should be used. And then make sure that you had all of your justification in one place in a way that is palatable for where it was going to be used. And it wasn't easy Ferris, and I'm sure the other two compiler vendors put a ton of effort into it. But it became viable, because so much of that had been done the right way, along the way, right, that level of formalism and just doing best practices is really at the end of the day, all functional safety is is it's just a formalized way of requiring you to do what is considered the best practices. And you do them either because it's required to ship a product in that industry, or because you go, well, this is the best way that we know how to do these things so far. And someday, if there's an easier, faster way to do it, that'll get switched to two. But today, just considering all of the potential failures and that everything fails, this is generally just the best way they know how.

Amos Wenger: So for folks who do not benefit directly from safety critical stuff, a good takeaway from this episode is that the Rust compiler project has been held and continues to be held to higher standards than like most open source projects. Otherwise, this effort wouldn't have been possible. Is that correct?

James Munns: Yeah, I definitely agree. It's definitely not the only one up there. But it's definitely one that did enough of the right things along the way. Just because that it became really seriously viable for that. The other thing is I also have a lot of folks asked like, Hey, Rust is better than C, right? Why isn't everyone immediately switching to Rust, right? Especially now because it's safety critical. Why is not everyone just switching immediately? And the answer is like these industries just move slow. And one of those things is we've already said every component can fail. So even if C has failure modes, that's actually less of a big deal than you might think.

We can mitigate known failure modes

James Munns: Because if we know all of the failure modes, we can just say don't do that. So like, that's what things like miseries standards are. And we go, there are known deficiencies in the language or the compilers that make up the language and things like that. And so we'll just set a checklist of rules that just say if you see, you must make sure you never do this, you never do this, you always do this, you never do this. And you either have a machine validate that's always true, like a static analyzer, or you have a human check it, or you set up a coding standard that says you never use triple function pointers or something like that, because they're so easy to get wrong that you you avoid them and things like that. Is the M in Misra for mitigate? I don't think so.

It's some automotive standards. I have no idea what Misra stands for off the top of my head. That's the thing with new things is you can't mitigate unknown failures. You can only mitigate known failures. And as much nice things as I've said about Rust, it will have failures and we'll find soundness issues or issues with the compiler or ways that are integrated or on certain platforms, there are certain issues and things like that, that you can't generally mitigate unknown failure modes, which means there's always sort of a risk reward benefit of, hey, even if this is way better, there's some risk of switching to it because it will remove a bunch of failure modes, but it will add in some unknown failure modes that now are harder to quantify until we found them.

Amos Wenger: That's something anyone who's tried to drive adoption for us that a company can rate with.

James Munns: Yeah, yeah. And it's all risk. That's a good engineering thing to be considering is balancing risk and reward. But yeah, yeah, safety critical. The wheels move even slowly,

Amos Wenger: but they do move. Speaking of wheels, a misres stands for the motor industry software reliability association.

James Munns: There you go. Yeah. And back in the day, C was a huge improvement over assembly. Like there was a long time that a lot of anything embedded would have been written in assembly and things like that. And having structured programming like C has was maybe even a bigger step up. Like the gap between assembly and C is probably larger in safety than the gap from C to Rust. I definitely think there's benefit of that second step, but like C was the newer safer alternative when a lot of these standards were being written, where they'd say you need to use a language with structured data and structured control flow and not assembly because it was new, but it was a market improvement over getting things right more often. Yeah. Safety critical is one of my areas where like I will talk anyone's ear off anytime. And this is like just barely scratching the surface of it.

Amos Wenger: But yeah, we'll have others about it. I'm sure someone else is bound to ask a questions. We'll have followups. I'm sure.

James Munns: Yeah. Please do ask questions. This is one of those areas where like I want questions because it's so niche. There's so many more people who have heard of it, but never worked in this area. So they've heard things like, oh, you have to do all this paperwork. You have to do all this. It has no value, whatever, but they've never worked in those areas and they don't really get why you would do that. Or they've only worked on the periphery of those industries where they weren't the ones like coming up with the plans. They don't know why the plan is like that. They just saw a stack of paperwork that they had to do and didn't understand why you were doing that. This is one of those things that all of my excitement about this is on sort of like the theoretical intent of functional safety in practice.

It's going to suck more than the intent in the same way that there's the scientific method for scientific research. And we say, this is the best way we know how to do science is to follow the scientific method. But then you have authors out there who are like, well, I have to get published. And so they're P hacking the hell out of their studies or finding just the right way to arrange the data where it looks significant so that they can publish and so they can put stuff out. You can do the exact same thing with functional safety. You can like, ah, I've checked all the boxes. I'm allowed to sell it now. But you just check the boxes and you've kind of missed the whole point of it. This is one of those, it's like that in any field. But I definitely think if you just check the box, it's not really, it's just a lot overhead.

Functional safety is imperfect, but it's useful

James Munns: There's very little value unless you like, take it to heart why you are doing these things. But if you really do approach it with that mindset, especially with like the knowledge of how bad it can get if you don't, then it's still an imperfect system, but it's still a useful system.

Amos Wenger: And it's like continuous because even once you ship a product and it's integrated somewhere, you still want feedback from your customers, right, to integrate the next versions and like to maintenance recalls, whatever you need to do.

James Munns: Yeah, it's something that doesn't end. Like there are new versions of this standard of all these standards regularly. And you get to pay lots of Swiss francs for the newest versions of all these from international standards organizations. But like the core concepts have been the same for for very long, but they are always changing of like, hey, you know, Misra is starting to come up with rules for Rust, where they say like, okay, well, the Misra rules were there. Here's how they apply to Rust. And there was some informal analysis a while back. But now Misra actually has like an addendum document to their most recent standard that says of all the rules we said these ones do apply to Rust, these ones only apply to unsafe Rust, you know, those kind of things where they are actually taking that feedback into account because they realize the industry is moving in that direction.

Amos Wenger: That's hilarious, because there's a Misra Rust repository on GitHub from PoliSync. And they're saying because of the proprietary nature of the Misra C specification, the description of each rule has been emitted. So you get to like the number or the identifier of the rule and whether it applies to Rust, like whether you get it for free in Rust or not, but you don't get to know what it is.

James Munns: Yeah, this is one of those things like 61508 is like that too. There is no public version, there's no legally public version of 61508. The fact that you pay hundreds of euros per document goes towards the people who are writing and maintaining that document, which is on one hand, I guess you have to figure out how to fund those industries somehow. On the other hand, it's also very frustrating that you can't look at all of this in public all of the time. Like I could even if I had citations for a lot of the stuff on here, if you didn't pay 600 euros to get that volume of the standard, you wouldn't be able to see that citation. And so the Misra Rust stuff is like that where they can say, well, look, there are however many items and we can just say does it apply or not, but we can't reproduce the tax of the Misra standard because it's like a paid standard. That's a whole conversation for another day. And I don't love that aspect of functional safety, but having worked at companies where I had access to the whole volume of standards or I paid hundreds or thousands of euros to get access to those standards, that part sucks. And I don't know a way around it, but that's like a whole other topic.

This episode is sponsored by Depot: the build acceleration platform that's on a mission to make all builds near instant. If you're tired of watching your builds in GitHub Actions crawl like the modern-day equivalent of paint drying, give Depot's GitHub Actions runners a try. They’re up to 10x faster, with unlimited concurrency, faster caching, support for Linux, macOS, and Windows, and they plug right into other Depot optimizations like accelerated container image builds and remote caching for Bazel, Turborepo, Gradle, and more.

Depot was built by developers who were tired of wasting time waiting on builds instead of shipping. It's made for teams that want to move faster and stay focused on what actually matters.

That’s why companies like PostHog use Depot to cut build times from over 3 hours to just 3 minutes, saving tens of thousands of build hours every week.

Start your free 7-day trial at depot.dev and let them know we sent you.

How is software safety certified?

Video

Audio

Show Notes

Transcript