134. How much is too much risk

Today on the show, we are discussing risk and it's role in any company. The conversation contemplates the balance of risk versus safety and how this push and pull can be managed in the best possible way. One of the main takeaways is the fact that more risk is much more suited to earlier stage startups than big, established companies with a large customer or client base. In the conversation, we cover the different factors that come into these considerations, including the length of outage windows, cowboy development mentalities and phase rollouts. We also get into why risk is so important in the ambitions of any project or company and when it might be most appropriate to accept more risk. For all this a bunch more on a central part of the development game, join us on The Rabbit Hole, today!

Key Points From This Episode:

A broad definition of risk in engineering and development.
The benefits of taking on product risk early on in a business lifecycle.
Shortening the outage window and allowing for more risk.
The industries that do not allow for downtime.
'YOLO' deployment and high-risk taking environments.
The risk versus reward equation; aiming to blow up requires risk-taking.
Weighing the enjoyment of high-risk companies with more strict environments.
The support of an organization that has effective ways to deal with production problems.
Phase rollouts and mitigating risk across sections of users.
Why risk is important in the success of a company.

Transcript for Episode 134. How much is too much risk

[INTRODUCTION]

[0:00:01.9] MN: Hello and welcome to The Rabbit Hole, the definitive developer’s podcast in fantabulous Chelsea, Manhattan. I’m your host, Michael Nunez. Our producer today —

[0:00:10.1] WJ: William Jeffries.

[0:00:11.3] MN: Today, we’ll be talking about how much is too much risk for your organization or product. We’ll talk about what is risk, why you would want to have it or maybe not and then some examples we see in the field, about risk. You know, we’ve been in many different clients, so it’s fair for us to have seen a lot of different –

[0:00:30.2] WJ: Yeah, organizations do it differently. I mean, different companies have different levels of tolerance.

[0:00:34.6] MN: Yeah. Well, let’s define it, right? What is risk?

[0:00:38.4] WJ: I mean, I think when it comes to engineering, what we’re really talking about is the risk that you fuck up your product in a way that makes you lose money or pisses off your customers. You know, maybe that’s a production outage, maybe that’s a feature that goes away, maybe that’s like you screw up your accounting and all your numbers are wrong. There are different degrees to which you can thoroughly screw up your software.

[0:01:00.0] MN: Yeah. You would want to identify based on your users whether they’re okay with production being out for 20 minutes and while you’re fixing something or whether that’s really bad. Whether you can alleviate some of that risk by rolling back what you just introduced to the customer that may have been a breaking change, like that kind of stuff.

[0:01:19.8] WJ: Right.

[0:01:20.7] MN: Well, that is like not calculated like by hand or with the machine but that tall plays a part in how much risk your product has. Why would you want to want that risk in the first place, I guess is the question?

[0:01:35.3] WJ: I mean; I think that taking on some product risk in order to ship faster is a good idea for earlier stage companies where you have less to lose. You know, if you’re not really making very much revenue anyway, then shipping out features quickly, so that you can start to find product market fit and come up with something that actually does generate revenue, that’s probably worth it.

[0:02:00.7] MN: Yeah.

[0:02:00.9] WJ: Right? Whereas, if you are a bank or a Fortune 500 company and you're talking about bringing down major infrastructure that’s going to cause a lot of negative press, it’s going to affect your stock price. You have a lot more to lose.

[0:02:15.9] MN: Yeah, you don’t want to deal with that, having outages in that regard. I guess a question I have is, yeah, I think it ultimately is up to the user and the product, right? Like suppose you worked at your website, deals with like publication, like newspapers and stuff like that.

Will your customers be upset if the website is down at three in the morning? Well, it depends on how many customers are actually reading articles at three in the morning. It’s the New York Times and you have a lot of users then you may not want to disrupt the user by having that risk, you want to have the website up, you know, as much as possible.

[0:02:53.6] WJ: Well, in some of these, you know, media companies, they make money on the big traffic spikes. It’s like, if you have an article that goes viral, you need all your ads to load for that one article, that’s the most important part of the site, right? I mean, if you have like your comments feature breaks, that could be down for days and not really affect your revenue but if you have that one viral article go down for the one hour when you could have had 15 million views, that’s like a way more expensive proposition.

[0:03:24.8] MN: Right. What are some ways that how can one or one organization, I guess, have more risk so that they could then find like the product that the users actually want and stuff like that? Because to me, the first thing I think of is like, your deployment game has to be crazy. Being able to ship things and roll back whenever necessary has to be –

[0:03:51.0] WJ: Absolutely.

[0:03:51.2] MN: Has to be perfect. The more –

[0:03:53.2] WJ: That’s a great risk negation strategy.

[0:03:55.6] MN: Yeah, because the better your deployment and your roll back strategies are, the more likely you’re able to make risks and do those deployments to figure out what the user actually wants in the experience and stuff like that.

[0:04:09.5] WJ: I mean, I think like, also, how much you want to spend your engineering efforts toward mitigating risk has to do with your total amount of risk tolerance, right? If you're pre-launch, who cares? Shit, whatever.

[0:04:25.6] MN: Yeah, go ahead, just get it out there.

[0:04:27.8] WJ: You’re like in beta. Nobody’s paying, whatever, the customers know that shit’s going to break sometimes.

[0:04:33.0] MN: Right, yeah. You can just ship it, however you feel that kind of stuff but say, you have a product that’s not in beta, it’s like, useable live code.

[0:04:43.4] WJ: Let’s say that your product is making you a million dollars an hour. You bring that down, you better be able to get it back up really fast because if it’s down for an hour, you’re out of a million dollars.

[0:04:52.9] MN: Yeah, that’s a lot of money, you don’t want to be out on. The ability to shorten that window, the shorter you can have that window of outage, the more risk you’re going to have because you’re able to do those deploys and roll back whenever necessary to shorten the amount of time if we’re using one million per hour in this example. The shorter you can have it, the more money you can make while still having risk to do that in the first place.

[0:05:21.1] WJ: I think part of it is getting everybody in your company on the same page about how much risk you are able to take. Because I mean, if your engineering department has a very high appetite for risk in your business side does not, you’re going to have some problems.

[0:05:36.2] MN: Right, that has to be definitely some cross communication as to how much risk is tolerable for the engineers and the business themselves.

[0:05:45.0] WJ: And if Bobby on your team has a really high tolerance for risk and Bobby’s manager does not, that can cause some problems.

[0:05:51.5] MN: Yeah, exactly. Bobby cowboy over here maybe want to ship all the time, we got to tell them to cool down, relax.

[0:05:59.1] WJ: Instant roll back, four times in the last 15 minutes’ dude. Do you want to check locally first?

[0:06:08.1] MN: Well, I guess going into the topic of when don’t you want to have that risk I guess is the question that I have, right? First off, Bobby should not have the ability to deploy a prod and he’s out there straight yolo-ing when the entire organization thinks otherwise. But not to the conversation that is from the organization to the cowboy coder — I’m going to call that person.

There are going to be some industries where you can’t do that I imagine or you can’t deploy all the time, you have to make sure it’s right the first time that it’s done because, you know, the one thing that comes to mind is like hospitals. I imagine like that software, that keeps people alive, pacemakers or whatever, have you. Has to be correct –

[0:06:52.0] WJ: 100% of the time.

[0:06:53.9] MN: 100% of the time, yeah.

[0:06:54.5] WJ: Zero downtime.

[0:06:55.7] MN: I mean, pace maker’s the one thing I don’t think that you can get over the air WiFi updates on a pace maker but like one of the website-

[0:07:02.5] WJ: You know, helicopter navigate. If you are flying an aircraft and you have a glitch that makes you lose control of the aircraft.

[0:07:11.4] MN: Yes, exactly. That, you want to make sure that that – there’s absolutely – you cannot have any risk when making changes for those things. Have you ever worked in an organization that’s like that?

[0:07:24.6] WJ: I mean, I’ve worked in big banks where there was a lot of compliance risk and if you did the wrong thing, the bank could get sued and there was a lot of security risk because people were trying really hard to hack the bank.

[0:07:36.3] MN: Yeah. Banks have a lot of money.

[0:07:38.9] WJ: Right. Very high value target for hacking.

[0:07:42.3] MN: Exactly.

[0:07:42.9] WJ: I’ve never worked on a medical product where someone could die.

[0:07:47.6] MN: No, I don’t think I have either and I’ve worked at a bank as well and very similar thing only you know, a handful of people. Only people who were compliant like, you had to take a course or something of that nature, were allowed to deploy to production and ensure that all the lines of code don’t have any memory leaks or where hackers can come in and make changes and what not, is the experience that I had.

It’s very like structured, there’s like almost like a political structure ensuring that everything is fine and good where the bank can’t get sued, everything is well and then you get to deploy. And it’s like, "Yay, we did a thing," it took us a year but that happens.

For good reason as you mentioned, right? You don’t want hackers taking all the monies from their customers. You want to make sure that the code doesn’t have breaking changes and what not. Let’s talk about the opposite side of the spectrum. Have you worked at a place that had a lot of risk, that just YOLO deployed? What was that? How is that different?

[0:08:57.5] WJ: I remember working on one client where we did trunk based development and the code would deploy every time that you pushed master and everybody is committing directly in master and pushing. So they have to deploy all the time.

[0:09:07.6] MN: I miss Dave right now because I am sure Dave can look at his tattoos and tell us exactly what episode number that was.

[0:09:14.8] WJ: Right, yeah and I mean I have worked on projects where, you know, people were very casual about [inaudible] into prod and making changes in higher environments. I think that generally the main that I noticed in common, between all of those companies, is that they all had a lot to gain and not a lot to lose.

It was like the more established you are as a company, the more money you’re already making, the more customers you are worried about losing, the more risk averse you become. When you're in the beginning when you are young when you are super startupy before you have product market fit, it is worth it to take on the additional risk because like if you get it right, if you blow up, if you become the next Slack or Spotify or YouTube or whatever and that is possible, right? That is a thing that it is in this companies heads is we can blow in IPO then it will all be worth it and if we blow it, well, you know we only have like a thousand customers, you know?

[0:10:26.5] MN: Yeah, it is only a thousand right now but.

[0:10:28.5] WJ: Or like whatever, we only have like you know –

[0:10:30.9] MN: A small amount.

[0:10:31.7] WJ: Like a $100,000 in revenue, monthly.

[0:10:35.9] MN: Yeah, I think I worked at a place that was more susceptible to risk because they wanted that to make those changes as fast as possible but those places definitely nailed down their deployment strategies so that they were able to also rolled back because the idea for them was we have this product in mind, we are going to be the next Slack, Spotify, if you will, but we need to have this in place right now. So that in the future, we don’t have to worry about this either.

Like when we are really big and we are making a million dollars an hour, we already have this strategy that allows us to deploy and roll back as fast as possible. Because that is another thing that I have seen where companies might blow up but then their deployments aren’t as good as it could be. But they have already been gaining a lot of revenue, which is like a weird place you don’t want to be in.

So you would have to probably hire a stronger dev ops team to ensure that they can build deployment process that can fit to the organization but not grow with the organization because you have grown too big.

Do you prefer the strict no-risk environment versus the risky?

[0:11:44.5] WJ: No it is so much more boring. The high risk tolerance like let us ship constantly is way more fun.

[0:11:49.6] MN: Oh no. I am the exact opposite, yeah no. Because like I don’t know, if I know my code is going to production, I get anxiety. Because it is like, “Oh I did all the test that I could but this is prod, like a lot of users.” A lot more users than what I just tested ever.

[0:12:08.6] WJ: So actually that is what I don’t like about the really low risk tolerance companies is that it is still possible that you could break prod. It is just there it is a way bigger deal. At the low risk companies, people are like, “Well okay, I mean so you pushed the bug into prod but we’re going to fix it really quick and it will be fine.”

[0:12:29.8] MN: Yeah, I mean that is true. I think that whenever I deploy I am getting anxiety but if I can see that everything is A-okay as fast as possible then that’s great. When there is like low risk tolerance and everybody approved it –

[0:12:43.5] WJ: I think also like the amount of mitigation that they have done for the risk, like setting up instant rollbacks or having good automated test coverage or like generally having a well-tested codebase, you know, having good logging and monitoring in place so that you can easily figure out what is going wrong in prod. Dock rising things so that is easier to rebuild. If the service goes down, spin it back up.

[0:13:07.9] MN: Right, those things have to be in place though for you to feel comfortable when you’re feeling risky.

[0:13:14.0] WJ: I don’t know, some companies are like whatever, we’re going to do it. I don’t know. I don’t know what we’ll do if it breaks.

[0:13:20.6] MN: We’re going to figure it out, together.

[0:13:23.3] WJ: I guess we’ll learn what kind of mitigation strategies we need to adapt after things go to hell in a hand basket.

[0:13:29.0] MN: Exactly, yeah I think we definitely are on two different sides. You, William, feel a lot more riskier than I do, which is great.

[0:13:38.1] WJ: I mean I think we are in agreement that both of us feel a lot more comfortable in environments where it is easy to deal with production problems.

[0:13:46.5] MN: Right, yeah because if you had a bad way, if the organizations is not ready to deal with production problems then that’s also anxiety inducing already.

[0:13:54.4] WJ: Right, I mean that is going to be a bad time whether people in general are highly tolerant or high intolerant of risk.

[0:14:00.7] MN: Yeah that is true, like if you know that then you are about to deploy requires manual hand rolling deploy and it takes forever to rollback, that whole time you are sweating.

[0:14:12.0] WJ: You have like a two hour deploy process that is going to be really painful.

[0:14:15.1] MN: Oh yeah, you are sweating the whole time.

[0:14:17.6] WJ: Just knowing that there is a bug and it is hopefully going to be fixing two hours but I mean you don’t even know right? It’s like the fact that the fix could not actually work in prod and then you’ve wasted two hours waiting for the deploy to happen.

[0:14:34.0] MN: Yeah, oh man don’t do that. Don’t do that to your employees’ people. It’s not –

[0:14:38.9] WJ: Yeah, have fast deploys, have instant rollbacks.

[0:14:41.6] MN: Yeah, be risky when you can that is the idea.

[0:14:45.6] WJ: Yeah, I mean if you are well protected it is okay if you ship a bug. I mean I think Facebook and Google do this where you do phase rollouts. So they will ship things to 1% and then 10% and then 50% of users and if it fails for some small percentage then they'll just roll it back automatically.

[0:15:04.1] MN: Yeah just put it back to zero, that is a great way to actually handle that risk just not one hondo-P all your customers get this new feature, you can roll it out slowly and like Facebook has millions, billions of users. 1% that is a lot of people.

[0:15:22.3] WJ: Yeah, I think it is harder to do if you are a normal scale company because you don’t have enough users to do a face roll out for.

[0:15:29.7] MN: Yeah I mean you could do 50-50 that is always good, everybody likes 50-50.

[0:15:35.7] WJ: Yeah, I mean you have to have enough people in your initial phase of roll out to be able to tell if things are failing because there is a problem with the code or if it is just a fluke, right? I mean if you only have like 50 active users at a time and you roll out to 10% that is like five people it is like –

[0:15:55.4] MN: They may not even know.

[0:15:56.1] WJ: Is it? I mean maybe they just have like a weird plug-in installed in their browser? It could be anything.

[0:16:02.5] MN: Yeah, I mean for 50 users, yeah you’d have to figure it out.

[0:16:06.2] WJ: Yeah and even if you have 10,000 users how many of them are online when you ship your feature?

[0:16:13.2] MN: Yeah, if you deploy at three in the morning, who is actually on your site and not with the window open asleep? You’d have to track, figure out ways to ensure that those users are active.

Well. risk is important. I guess it is if you want to – you need to ensure that you are ahead of your competitors if there are any in the space, you need to ensure that you can get those features out to your users before the competition does. The risk is definitely something that you will have to tolerate if you want to defeat the other organization.

[0:16:43.7] WJ: Yeah, I think the main thing is just making sure that it is discussed and that people are on the same page about it because even if you are at a very high risk, even if you are at a place that’s very averse to risk, it is impossible eliminate it all together. Any change to production code carries a certain amount of risk with it.

So it is important to talk about exactly how bad we’re going to let it get like as a half hour of downtime okay? Is a minute of downtime okay? Is a second of downtime okay? Where is the line so that everybody can take whatever steps are necessary to mitigate the risk to the point where they feel comfortable making changes and meeting that bar.

[0:17:26.6] MN: Right, I mean it gets discussed with I guess that we have spoken about this before with the SLA’s and the SLO’s like those, that conversation definitely it comes up and you can identify the amount of time you want of downtime or whatnot. The business can even propose these things and the engineers can take it or we can have a further discussion as to what those numbers actually are.

[0:17:48.6] WJ: Right, you don’t want to be in a situation where engineers thought it was okay to break a thing and they broke a thing and it wasn’t down for that long and business is tearing their head off.

[0:17:58.4] MN: Exactly. “Things were down for 83 seconds, why was that?” “But it was just 83 seconds,” right? Like one person may believe 83 seconds is nothing but the business can see that as a huge problem.

But risk is important and you need to ensure that before you get a little risky make sure your risk tolerance is high by ensuring you have the software and the strategies that would support that.

[END OF DISCUSSION]

[0:18:24.8] MN: Follow us now on Twitter @radiofreerabbit so we can keep the conversation going. Like what you hear? Give us a five star review and help developers like you find their way into The Rabbit Hole. And never miss an episode, subscribe now however you listen to your favorite podcast.

On behalf of our producer extraordinaire, William Jeffries and my amazing co-host, Dave Anderson and me, your host, Michael Nunez, thanks for listening to The Rabbit Hole.

[END]

Links and Resources:

The Rabbit Hole on Twitter