Transcript for Episode 124. Performance Testing
[0:00:01.9] MN: Hello and welcome to The Rabbit Hole, the definitive developer’s podcast in fantabulous Chelsea, Manhattan. I’m your host, Michael Nunez. Our co-host today —
[0:00:09.8] DA: Dave Anderson.
[0:00:10.8] MN: Our producer who is sitting right in front of me.
[0:00:14.1] WJ: William Jeffries.
[0:00:15.1] DA: What? Back from the mysterious east.
[0:00:18.2] MN: Land.
[0:00:20.0] WJ: Yeah, I took The Silk Road, it’s a long journey.
[0:00:23.7] MN: It’s been a long time since we actually recorded in the same room and it’s glad – I’m glad that you’re back.
[0:00:28.9] WJ: yeah, I mean, it’s crazy because you went on paternity leave and then right before you came back, I went to India for a four month project.
[0:00:37.6] MN: Yeah.
[0:00:38.9] DA: Hyderabad.
[0:00:39.9] WJ: Yeah. Hide your bad news baby.
[0:00:42.6] MN: Yeah.
[0:00:43.9] DA: How were the laser shows in Hyderabad?
[0:00:45.3] WJ: They’re really into the laser shows.
[0:00:48.5] MN: I didn’t know that I enjoy laser show.
[0:00:51.5] WJ: Yeah, it’s a good time. You know, you just get like a monument or a statue.
[0:00:55.3] MN: Put some lasers on it?
[0:00:58.4] DA: It’s great. Fun for the whole family. It was just a statue before, now it’s like entertainment.
[0:01:03.5] MN: there you go. Today, we’ll be talking about performance testing or why you should do it and pitfalls to avoid when you start performance testing. I’m sure there is some kind of performance testing when there are laser shows out in Hyderabad, a very different to performance testing on software developing level.
[0:01:24.2] DA: It’s not just about the color of the lasers, you know?
[0:01:28.7] MN: Why would a company or why would someone want to performance test?
[0:01:32.4] DA: What even is a performance test?
[0:01:33.7] MN: What is a performance test, what is the purpose? What is performance test?
[0:01:38.3] WJ: That’s a good question. I mean, to me, a performance test is a test that gives you information, gives you data, gives you measurements about the reliability of your app under load. You’re generating some kind of load against your application and then you're measuring how well it performs.
[0:01:59.9] DA: What does that mean in practice? If I have 10,000 users. It will take me five seconds to respond to each request or what are like some of the metrics you might look at.
[0:02:12.8] WJ: I think we’re getting into like the different, that we’re now getting into the different types of performance test, right? Because I mean, if you’re looking to see how it affects latency, when we’re talking about simulating a hundred thousand users or 10,000 users or however many users, it’s like, in order to know how to interpret the results or how to generate the test at all, you kind of need like a goal here.
I mean, what are we doing here? Are we doing capacity planning where we’re saying what we think we’re going to 10x our user base? We have 100,000 users a day now, what would happen if it were a million? Then that’s capacity planning on what you need to do is run a load test where you have a fixed amount of load that you’re going to throw at your app.
If instead, you are trying to do some diagnostics, for example and the last time that you had a bunch of traffic, you know, everything broke and you don’t understand why or how, maybe you would want to do like a spike test. Let’s say that you just – you had a huge spike in traffic and one of your services went down and you don ‘t really know why.
[0:03:19.1] DA: You’re trying to do like a CSI kind of thing, replicate the crime scene and like, pick the trajectory of the bullet that like, you know, took out the service.
[0:03:28.3] WJ: Yeah, Maybe you spin up a testing environment which is probably like and then you replay your production traffic over it or maybe you create a spike test where you throw a bunch of load at it all at once, you know, with this similar amount of load to what you saw in production when the service went down. That would be like another – our maybe you're just doing preventative measures and you want to regularly be throwing a lot of load at your application in a controlled way.
Maybe during office hours where people are there to fix it, if you’re doing in pride or maybe you're doing it in like a lower environment like UAT as part of your deployment pipeline, you know? Just like you're running your unit tests or your end to end test you might want to run some performance tests to make sure that your metrics are holding up that you’re returning 200s for valid requests, that you’re not getting time out’s that your latency isn’t above whatever threshold you care about.
[0:04:24.0] DA: Kind of like introducing like almost like a chaos monkey type approach where you’re trying to like do a bad thing to your system in order to better understand like a real scenario where a bad thing might happen.
[0:04:38.8] WJ: Yeah, I mean, I think we were on a client together where we setup a performance test that ran nightly and we threw a bunch of load at it during off peak times. So that if there was something broken, that we would find out about it when it wasn’t going to be revenue impacting and was easy to fix and then you know, in the morning, when we came in, we could try and address the problem before things got bad.
[0:05:04.8] DA: Right, we want to know when Kim Kardashian’s butt was going to break the internet.
[0:05:10.6] MN: Those things happen. I have a question. You had mentioned threshold in terms of certain performance test. Where do those numbers appear? Do you run a performance test and then get the average of all the performance tests that you’ve run, to figure out what, like for example, our throughput, or load.
Obviously, for load, you want to make sure that things load fast, right? What is fast? How does the company or the engineering team come up with 'fast' is?
[0:05:45.0] DA: Yeah, what is a failing load or like a failing result for that?
[0:05:50.3] WJ: Yeah, this is a great question because I think this is one of those things that people don’t think about when they first start performance testing and then they realize like, "Okay, I mean, it’s kind of still working, right?"
[0:06:02.0] DA: Still there.
[0:06:03.8] WJ: Page took five seconds to load and for I don’t know, you know, 7% of the users had never loaded at all. Is that bad? How bad is it like can we deal with that? Can we live with that?
[0:06:13.4] DA: Right. Asking the hard questions, right?
[0:06:17.1] WJ: I mean, what we’re talking about over here really is, I mean, this is as a receptive site reliability engineering concept of SLO’s. You have your service level indicators like you know, the amount of time that it takes for a single web request to return so that this is really a measure of latency and then your service level objective would be something like, it should be under 500 milliseconds.
You know, when we say it should be under 500 milliseconds, we’re going to define some kind of a threshold like – for 95% of users or a P95 or P99, whatever your percentage is. So then, when you go to run your test, you can tell whether it’s passing or failing by checking to see what percentage of users exceeded your 500 millisecond limit. If it’s over 5% of whatever threshold you set, now you’ve exceeded your arrow budget.
[0:07:11.4] DA: Yeah, I guess like some companies like they even like have a very specific understanding of like when the page latency is this. Our conversion rate is this. We make this much money until there might be like a really – depending on how mature data collection and analytics of that is like you may have a very real understanding what the impact of a page taking five seconds to load this.
[0:07:37.1] WJ: Right, I think Amazon did this, they found that if they were able to shave an extra hundred milliseconds off of the average page load time they got an extra 1% in sales revenue which for them is you know, tens of millions of dollars.
[0:07:50.2] DA: Yeah. Like living and dying by like the razor thin margin or I guess the name is on the case, not dying, just living or living really large.
[0:08:01.7] WJ: Yeah, I think they jumped like another three spots on the Fortune 500 list, they’re crushing it.
[0:08:08.6] MN: I’m sure that place is riddled with a ton of performance tests in every single sense and every possible thing that can happen to that website, they’re like running it.
[0:08:18.9] DA: Right. I think this is kind of a tangent but I remember reading like the founding of AWS and it was founded like, because they had all these extra capacity that they had for like spikes in traffic during like peak shopping times.
[0:08:35.5] WJ: Christmas, yeah.
[0:08:37.5] DA: All these Christmas servers, nobody’s using that. You know, 11 months out of the year, that’s how you get AWS.
[0:08:46.4] WJ: Now I think AWS generates more revenue than the store.
[0:08:49.8] MN: Than the shop itself.
[0:08:51.8] DA: Christmas, 12 months of the year.
[0:08:55.6] MN: There’s a type of testing that I think you haven’t mentioned. What is scale? When would you performance test for scaling?
[0:09:02.4] WJ: Yeah, well, I mean, a lot of systems have some kind of an autoscaling feature and so we might do some performance testing to see what the characteristics are when you hit your autoscale points. How smooth that transition is, how well it handles, this is sort of a cross over with spike testing. But you might run a spike test and see how well the autoscaling works.
[0:09:23.0] MN: I see. I guess it’s hard to write like a traditional integration or unit test on your [inaudible] that says that you will scale X number of servers at this point. The only way to truly understand the impact of your configuration is to see what happens when you have that situation.
[0:09:44.7] WJ: Right, yeah. Make sure that your autoscaling is working properly and that it is you know, however fast you need it to be because there are different ways to setup your auto scaling, right? You’re going to have different auto scale policies and you can pay more in order to have the autoscaling kick in faster.
Throwing some test at it, like particularly ones that simulate pride like user traffic is helpful because then you can see like okay, how much money could we save or how much better our user experience could we provide if we had faster auto scaling?
[0:10:17.7] DA: Right, yeah.
[0:10:19.4] MN: I have another question though. If we go back, I think we’d done a couple of episodes on testing and we’ve discussed the Martin Fowler triangle pyramid of like testing. Where like the very bottom is your unit test which you shall have a ton of, followed by integration and then acceptance. Where would performance test exist in this triangle, you think?
[0:10:41.2] WJ: That’s a great question. Because I don’t think Martin Fowler has really ever opened on that. I mean –
[0:10:47.0] MN: Yeah. I’m just curious like, do you –
[0:10:48.3] WJ: I would say it’s like at the top, I mean, it would be the one – you would need fewer performance test than you would need acceptance tests, it would be the – they are the most expensive, by a lot.
[0:11:02.5] MN: Yeah, especially the scaling one because you probably have to pay for that server to kick up, it’s like –
[0:11:06.9] WJ: Yeah, all of these, you’re probably going to be wanting to. If you have to throw meaningful load, you know, if you’re using Bees with Machine Guns or any of these paid services like [inaudible] or whatever, your BlazeMeter, you’re going to be paying for a bunch of cloud hosted servers to load at your app.
You have to pay for that and I mean, it’s going to be slow. Because you have to throw enough load at it for the server to have a chance to respond and fall over or do whatever it’s going to do. Especially if you're running a soak test, where you are throwing consistent load at this app or this system for you know, potentially hours or even days. Those are extremely expensive tests.
[0:11:51.6] MN: Okay, a soak test would be a sustained load over some time.
[0:11:56.5] WJ: Yeah.
[0:11:57.0] MN: Where as a spike is just like a quick punch to the face and then leave it at that.
[0:12:02.1] WJ: Right. Usually the spikes are much more load than a soak test would be but for a much shorter period of time and the soak test is really useful because they expose these problems that only come about after a server has been straining for a really long time. If you have a memory leak for example. A spike test probably isn’t going to catch the memory leak. Because you run out of the traffic pretty quickly.
[0:12:26.2] DA: Yeah, you might get lucky.
[0:12:28.0] WJ: Whereas if you’re throwing consistent load and even if it’s not as much load for six, eight, 10 hours, you might find that a lot of systems are running out of memory or running out of disk space or whatever, you got some queue they got backed up, there could be all kinds of problems.
[0:12:47.3] DA: That makes sense.
[0:12:48.5] MN: Say I’m sold on performance testing my application. You mentioned a tool that sounded pretty cool and I would be terrified if it actually existed, you mentioned Bees with Machine Guns. What a great –
[0:13:01.5] WJ: Great name for a tool.
[0:13:03.4] MN: I do not want Bees with Machine Guns. I’m already terrified of bees as it is.
[0:13:08.3] WJ: Bees are already scary, yeah. Don’t arm them.
[0:13:11.4] DA: The bees through all the Bronx, right?
[0:13:14.8] MN: They’re different, they go the New York caps and Timberland boots and so what is Bees with Machine Guns?
[0:13:20.5] WJ: Yeah, that’s an open source tool that will spin up AWS EC2 instances for you and then you can point them at your server and it throws a bunch of load. It’s like an open source way of doing the same kind of stuff that tools like BlazeMeter would do for you.
[0:13:34.8] DA: Okay, which is like a paid tool? Just more broadly I guess, I’m like, "Okay, yeah, I get it, I need to understand my system under load." Like how do you go about implementing it? Do you need to use a tool like this? Should I roll my own?
[0:13:52.4] MN: Should I tell my mom to visit the website?
[0:13:56.1] WJ: Honestly like —
[0:13:56.1] DA: Go to fiverr.com.
[0:13:58.7] WJ: That is a totally valid performance test is to get you and like two other devs to all hit the website at the same time from your laptop.
[0:14:06.5] DA: Hopefully you have more than one worker on your productions but maybe you don’t it could just be hosed.
[0:14:12.3] WJ: I’ve worked on projects where you get four engineers all hitting that same end point that hasn’t been optimized yet and it will fall over and it is nice to do this kind of manual testing just to confirm that your test suite is returning accurate results. But I think that probably a good starting point would be to use a local command line tool. I think probably the most popular one is JMeter, which actually comes with the CLI’s.
It is pretty user friendly and then you can throw some load at it just using your laptop without having to go to AWS or some kind of third party service and then once you have a general sense of what the performance characteristics of your app are and you want to do something more elaborate like throw more load than one laptop could generate or hook it up to a Jenkins job or something more automated, then you can move onto some of the more robust tools.
[0:15:10.4] MN: And you mentioned a couple of the paid ones, one being BlazeMeter.
[0:15:15.5] WJ: Yeah, BlazeMeter is a paid cloud hosted version of JMeter. So, I mean, it is handy if you have been working locally with your JMeter files because JMeter, you create these XML files, which are a configuration. And then you can upload it to Blaze Meter and Blaze Meter allows you to horizontally scale and they produce like graphs, charts, inter web interface is very handy.
[0:15:39.2] DA: So you may hit a wall with your local testing or the resources that you have available and very easily –
[0:15:45.4] WJ: Yeah if you have a non-terrible app then you are probably not going to be able to generate enough traffic with your local laptop.
[0:15:52.7] MN: What are some common pitfalls that you have seen when people or engineering teams start to implement performance testing?
[0:15:58.8] WJ: I mean we already talked about one of them, which is not really having any expectations for what constitutes failure. I mean you really do need to define what success looks like and what sort of threshold you’re willing to tolerate and that should be a conversation you have with customers. Like with customers and with product, "What exactly do we really need to maintain?" because nobody has a 100% uptime. Nobody has zero latency like these things are impossible. Even Google has downtime.
[0:16:27.2] MN: Oh yeah like Slack. When Slack is down like everyone knows and everyone starts bugging out and, “Why can’t I message people? I want to see my cat GIF’s.” GitHub is another one that is really important to individuals and –
[0:16:40.5] WJ: And Amazon and AWS goes down the whole internet shuts down.
[0:16:44.4] MN: Yeah like 50% is probably on AWS.
[0:16:47.5] WJ: Do you guys remember what it was like 2016 when they went down?
[0:16:49.9] DA: Yeah S3 went down and what was it like, the red light indicator on the status page was stored in S3? So it just displayed a green light indicator because it was cached, but it was down.
[0:17:05.6] WJ: Amazing.
[0:17:06.4] MN: That’s great, oh man. I think hitting the cache on your performance test is probably another issue that may arise. Because you want to test your application when it retrieves data from the beginning and not something that’s already cached.
[0:17:22.3] DA: Right especially if you are using a service like Fastly or something for caching like maybe you want to know that it is working but then you don’t want to invest in the soak test on Fastly or something. We have seniors be like spending money to verify that Fastly is fast at a certain point.
It's like, “Yeah that is pretty fast,” good job. Yeah I guess you would also go wrong if you weren’t using a prod like data. Like if your data set is very small and you had like in non-perform inquiry. You know if there is ten rows then maybe it is fine, like you may have an okay response time but then as the data starts to scale out then you could see some more issues when you have more load.
[0:18:10.8] WJ: Yeah.
[0:18:11.6] MN: Would you call it a best practice to do your performance test on production?
[0:18:16.6] WJ: Yeah, I would say so. I think a lot of people are afraid to do that.
[0:18:20.7] DA: Oh yeah, I was going to say that I feel terrified that you just said that.
[0:18:23.8] MN: I just got anxiety right now. All prod like with the possibility that it will go down?
[0:18:29.6] WJ: Yeah absolutely. I mean, so you should have a failsafe mechanism where you can shut where the performance test will automatically shut down if prod starts to fall over. But I mean absolutely, you should be testing your product environment because I mean wouldn’t you rather know that your prod environment can’t handle load when you are running a performance test instead of when you have an actual user spike?
[0:18:53.4] DA: Right and there is so many things like in a modern app that can be configured completely differently like between environments. Like you could be using [inaudible] in a lower environment but Amazon SQS on a higher environment. So like the performance characteristics of those maybe completely different. The configuration of those will also be completely different. So you want to know.
[0:19:15.9] MN: So the only person that could bring down production is me. So I know exactly when it happens and I think that is a good point because if we could control when it goes down then we can know why it went down.
[0:19:26.9] WJ: Yeah, I mean I think if you just figure out how much your maximum historical load has ever been and you just throw double that at it every morning before people get in for work or while engineers are in the office or maybe late at night whenever non-peak traffic is and you have a failsafe mechanism where if the performance test suite starts to register a whole bunch of 500’s or timeouts, that it shuts off the test then this is like a really wonderful preventative measure that can help you with your capacity planning later on.
And it is tricky because when you are testing and performant in production, you also have production traffic. So I mean it is harder to know exactly how much load you’re throwing at the system at any given time. But if your system is robust then it should be able to handle an order of magnitude more traffic than you have. So you should be safe.
[0:20:20.6] DA: What are some alternatives like if I really just can’t afford hitting my production environment with this kind of traffic or like –
[0:20:29.1] WJ: I think having a dedicate performance testing environment is a great thing to have regardless of whether or not you’re going to test in production. But it is a great place to do automated testing as part of your deployment pipeline. Because when you deploy a code, I mean you don’t want to run a performance test every time you deploy a code and you know have that run in production.
And you probably wouldn’t want to fail a code deployment with something that is non-deterministic as the production performance test. But having that run in a lower environment as part of your deployment pipeline just like you run automated user acceptance tests in a lower environment as part of your pipeline, I think is really valuable.
[0:21:08.6] DA: Interesting.
[0:21:09.1] MN: Right, so if it is not production itself, something really, really, really close to it and then –
[0:21:13.6] WJ: Yeah, [inaudible] environment if you don’t want to set up another dedicated environment for performance testing but yeah, it is close to prod as possible.
[0:21:20.3] DA: Earlier you mentioned doing diagnostics to understand an issue or a failure by reapplying pressure in traffic. How would I go about even start — collecting what my production traffic looks like or how?
[0:21:35.5] WJ: Yeah, so this is tricky because this is a constant challenge. It is making sure that your tests are prod-like that your performance test are similar to prod like traffic. So you can record the actual web requests that get made like to your application and then you can replay those exact requests. But you can run into issues with this when you’re dealing with sensitive information.
Because I mean you’re going to want to clean that data if there is any sensitive user information. You may not want that stored and you know in your test suite, which is probably going to be committed code and accessible code like that data is going to be accessible to developers. So I mean you’re probably going to have to make some compromises about how prod like your traffic is.
[0:22:18.5] DA: Isn’t it going to be tricky too with like posts that may change the state of the application? Like to truly replicate a situation, you’d have a starting state on the database and the cache and everywhere else. Some request the chains of the state and then resolve to that. It seems like it is going to be a little tricky to figure how to do that.
[0:22:42.9] WJ: Right, yeah absolutely. If a user is deleting a blog post and then you go and replay that traffic. But that blog post doesn’t exists anymore, you might register a failure even though everything is fine. So I mean it becomes really hard. The whole prod like traffic thing I think is a holy grail and I don't think I have ever seen anybody had truly prod like traffic replayed unless it is a very simple service.
[0:23:06.1] DA: Right, like a CMS kind of thing like you’re just getting results, getting pages.
[0:23:10.7] WJ: And also the size of the data that comes through. I mean that is one of the things that I think you really do want to make sure you get right is, if users are consistently uploading ten megabyte files, your performance test shouldn’t be uploading 10 kilobyte files.
They should be comparable in terms of the volume like the throughput. So I mean that’s a throughput test. That is like another kind of category of testing where you try and use the same volume of data or push the limits of the volume of data.
[0:23:44.3] DA: Yeah, totally. That makes sense.
[0:23:46.3] MN: Yeah, it seems like we touched on a whole lot of different types of performance testing. I’ll ask, out of the ones we discussed, what performance test is the most important one that people should start with?
[0:24:01.0] WJ: I would say probably stress testing. That is what I always recommend that people start with is because normally people don’t really know the performance characteristics of their app, like how much traffic could it really handle and so that can be kind of anxiety provoking. Just not knowing how much traffic your app can handle and so if you run a stress test, what you’re doing is you are throwing enough — you continue to throw more and more load at your app until eventually it does break. You stress it until the breaking point.
I think the term actually comes from mechanical engineering, when you would stress like a tool or a bridge or a piece of equipment to the point that it actually physically breaks and what is nice about it is that now you know. Well, we can handle a 100 users, so we have 10 right now. I guess we have a little bit of time but –
[0:24:48.7] DA: Yeah don’t let that 101st person on that bridge.
[0:24:50.6] WJ: The sales team is going to [inaudible] us real quick, yeah.
[0:24:53.8] MN: So start if you are starting to performance test, be sure to dip into that stress test first and then –
[0:25:01.5] WJ: Yeah, people are very quick to jump to load testing and they’re like, “We are going to throw exactly 500 requests a second at it. As though that were – people always have some reason why they think that number is the appropriate number.
[0:25:16.1] MN: Yeah, which goes back to what you mentioned before that should be discussed with the customer and product people as to what that number is and come up with the proper performance test not this arbitrary number that comes up and –
[0:25:27.6] WJ: Right, I think it is very easy to use it to justify whatever position you already have. Like if you want to invest in more infrastructure or if you want to argue that everything is fine, like load test are a great way to confirm your existing bias. It is like, “Oh yeah, you know we know we’re not going to have more than 200 requests a second and so we did a load test and it wasn’t that bad. So it is probably fine, right?"
[0:25:54.8] MN: Things are passing, everything is good.
[0:25:56.7] WJ: Right, yeah I mean they are not totally passing but I think whatever were the results were is probably good enough, right?
[0:26:03.5] DA: Yeah, I guess like stress testing too it gives you a baseline and so you can say like, “Okay what is a 100% load. What is the point at which we will have a [inaudible] and then we can see what the characteristic system are on the load of 90% or 80% or whatever.
[0:26:20.4] WJ: Yeah and it enforces that conversation with stakeholders about what is acceptable. Because I mean even at low load, you are going to have some small percentage of errors. So, you know, having that conversation is saying, “Okay, we have decided that anything over five seconds of page load and anything over five percent error rate, that’s where whatever the numbers are, that’s unacceptable and we are willing to sacrifice feature delivery work in order to invest in improvements to the infrastructure in order to fix this. Like we are going to stop the engineering team, we are going to stop the presses."
[0:26:58.8] MN: Right, you need to fix the infrastructure, make sure all of these numbers are passing.
[0:27:02.5] WJ: Right.
[0:27:03.0] MN: Yes, so if you are not running performance tests, you should do them. Talk to your customers and collaborate with the organization to determine what is passing and ensure you have performance test that covers those numbers. If you are starting to performance test, start with the stress.
[0:27:21.2] WJ: Yeah and your customers might tell you that they want a 100% uptime and zero latency and, you know, you might need to renegotiate this is like –
[0:27:34.6] MN: That’s hard.
[0:27:36.0] WJ: Maybe talk to business about it, come up with some kind of a compromise.
[END OF INTERVIEW]
[0:27:40.8] MN: Follow us now on Twitter @radiofreerabbit so we can keep the conversation going.
Like what you hear? Give us a five star review and help developers like you find their way into The Rabbit Hole and never miss an episode, subscribe now however you listen to your favorite podcast.
On behalf of our producer extraordinaire, William Jeffries and my amazing co-host, Dave Anderson and me, your host, Michael Nunez, thanks for listening to The Rabbit Hole.
[END]
Links and Resources: