201. Metrics as Incentives

Many managers assess developer performance as if they were runners, where how many lines of code you write determines how good you are. Instead, developers are like baseball players, where a suite of metrics is needed to measure performance. In today’s episode, we unpack how metrics are used to judge coding performance and how metric incentives can create less than desirable coding behaviors. After chatting about how vaccine eligibility metrics can incentivize binge eating, we chat about the ins-and-outs of using test coverage to measure project health. We then discuss how managers are weighing their team’s efforts in the age of remote work. While reflecting on when it can be useful to look at how many lines of code someone has written, we explore how incentivizing individual performance can damage a team project. Later, we share some of our successes with AB testing. Using metrics as incentives can lead to both positive and negative unintended consequences. Join us to hear about how these knock-on effects can impact your work.

Key Points From This Episode:

How your BMI affects your COVID vaccine eligibility.
We discuss what test coverage numbers are acceptable for a project to continue.
Setting limits on how much your coverage can deviate from the master build.
How teams can be incentivized to hit a code coverage number and not to catch bugs.
Exploring the limits of test coverage.
How managers measure developer productivity.
Why developers are like baseball players, not runners.
When it can be useful to look at how many lines of code someone has written.
The dangers of incentivizing individual performance over team success.
Using AB testing to determine product direction.

Transcript for Episode 201. Metrics as Incentives

[0:00:01.9] MN: Hello and welcome to The Rabbit Hole, the definitive developer’s podcast. Living large in New York. I’m your host, Michael Nunez. Our co-host today.

[0:00:09.1] DA: Dave Anderson.

[0:00:10.3] MN: And our producer.

[0:00:11.3] WJ: William Jeffries.

[0:00:12.8] MN: Today, we’ll be talking about metrics as incentives.

[0:00:17.0] DA: When they can go horribly wrong.

[0:00:20.0] MN: Or horribly right, I don’t know, if that’s even a thing.

[0:00:22.6] DA: In either case, maybe horrible but —

[0:00:28.3] MN: Yeah. As software engineers, we’re surrounded by data. And often that data can turn into different sources of information that then your product project manager, VP of engineering, CTO, use that information and then goes on with it. But we’ll use other different points and pieces of metrics throughout our conversation today.

[0:00:49.1] DA: Yeah. A personal metric that has recently popped up in my life as an incentive is that, in the State of New Jersey, the eligibility for the COVID vaccine is open to a number of medical conditions that are complications for the disease. Which is great, equitable vaccine access for people who need it is wonderful. One of those in New Jersey to some controversy is the fact that your BMI is tied to your eligibility.

If you're overweight, with a BMI of 25, which is a weird metric as it is, we don’t have to get into the definition of what BMI is, I don’t know what it is, William probably knows what it is. Maybe he’ll tell us. I’m not BMI 25, I’m 24. I thought about it and I’m like, “Wait, I haven’t weighed myself since like six months.” I’ve made so many bad decisions of what I’ve eaten in my life and I just had Wendy’s chicken sandwich for lunch and not exercising.

So I step on the scale and I measure it. I’m 25, I broke that thing. I’m like, is this real?

[0:02:05.1] MN: Is it happening, is it like the Willy Wonka golden ticket? You thought you now have the ability to get the vaccine. You’re 25 bro.

[0:02:13.9] DA: I was like, I don’t think I scientifically weighed myself, I had just eaten a huge amount of food so I don’t think it fully counted but now I’m like, “I’m pretty close, maybe.” This is like a great tie-in for the fast-food chains, you know?

[0:02:31.8] MN: Exactly. Wendy’s got you the vaccine bro.

[0:02:35.1] DA: Just eat your way to the front of the line like Pacman.

[0:02:39.0] MN: Exactly. I mean, I think yeah, you mentioned before that New York is — the number is at 30.

[0:02:43.3] WJ: That’s too high of a bar man.

[0:02:46.6] MN: Challenge accepted bro.

[0:02:48.2] DA: No buddy, don’t do it.

[0:02:51.3] MN: I’m going to go get a ton of cheese. Of like canned cheese, nacho cheese, just eat it with my hands every day.

[0:03:00.0] DA: No. A little harsh though.

[0:03:00.9] MN: Clogged arteries and everything, it’s happening. I need this vaccine. I mean, everybody’s been locked at home, being not so kind to one’s self. I imagine that this is probably Jersey’s way of saying, “We’re sorry we kept you locked in, get the vaccine a little earlier.”

[0:03:19.8] DA: Right, yeah.

[0:03:20.3] MN: But that New York 30 — I mean, I don’t know, I’m no scientist but I’m sure there’s a specific reasons. I mean if you're overweight then you’re more at risk of COVID doing damage, which is why they want you to go and get the vaccine if you’re overweight. I might have to eat a jar of mayonnaise and then hopefully I gain all the weight.

[0:03:38.7] DA: Right, like if that metric is changing your behavior, which I feel like it kind of did this week because although, I’m weighing myself later and I’m not, I’m like, ”You know what? I’m going to get that burrito for lunch.

I mean, I already made up my mind, I’m not doing that, I want to change my behavior and exercise but I’m like, “If I eat a burrito then —”

[0:03:58.9] MN: — I can get a vaccine.

[0:04:02.0] DA: Yeah but it’s awful. Morally gray.

[0:04:04.9] MN: As you mentioned, it can influence fast-food to encourage people to eat it so they get just above that number necessary in their state to get the vaccine.

[0:04:14.9] DA: Yeah. And in our day-to-day lives as developers, sometimes test coverage is that burrito that you eat to get yourself above the line.

[0:04:24.2] MN: Yes. You stay at your desk to make it sure the test coverage goes above that number necessary for your team.

[0:04:31.2] DA: Then you can’t move. You’re so bloated with test coverage that you can’t refactor anymore. You’ve poured it with cement.

[0:04:38.8] MN: What’s the magic number? Because if people say often, you got to go for that 100P right? 100%, but is that really possible? Are you testing every file in your code base? I mean, is that really possible? I don’t know. I think I’ll ask everyone in the room, in the Zoom room, if you will. What were some test coverage set-up on the clients that you’ve been on that worked well with y’all? Was the number below, anything below 80 is not worth it. Stop the build, do you all have a number in mind?

[0:05:12.1] WJ: I’ve seen build break at 70% or at 80% or at 90%. I’ve never seen anybody try to break a build at under 100%. I think I see a lot of places that do not break the build at all, no matter how well the test coverage is.

[0:05:24.3] DA: Yes, I’ve seen some places where they start with the very low percentage and as an incentive to increase the test coverage, they will fail your PR if the test coverage goes down. If the diff of coverage between master and your branch is lower then they’re like, “We got a problem here.”

[0:05:42.9] MN: Yeah. I was going to say, that’s the one that I’ve seen that worked very well for me, it’s like it incentivized testing. Maybe it was like a codebase that was a little bit older that they introduced testing so they want to increase it. It was like, anything under 2% of what is different from master will stop the build.

I thought it’s like a very great way to ensure that you continue to add coverage to your codebase without this number, this 70%, 80 or 90 just like, don’t go under the two bro, that we’ve already had. It doesn’t matter what number you start, it’s as long as what you are merging into master.

[0:06:19.0] DA: Yeah, I feel like when you reach a certain point, you probably want to turn that off because you don’t want to keep going through the roof. Because you might end up with tasks that are low value and cover every fiddly bullion case that will make it more challenging to refactor down the line.

[0:06:39.1] WJ: I’m fine with people wanting to test edge cases. My concern is always when people are optimizing to hit a code coverage number and not to write tests that are actually going to catch bugs. That’s the thing that I see where the real issue is if somebody passes in like a node or some other data type or you know, it’s like a function that accepts many different data types. And they test the one that’s really easy to test and will hit all the lines of code and not the data types that are more likely to cause problems.

[0:07:17.5] MN: Right, because that incentivized the developer to write tests for the sake of the increase of the test coverage number but not the quality of the code. And ensuring that things are bringing properly.

[0:07:28.3] DA: Right, like being thoughtful about the purpose of the test.

[0:07:31.0] WJ: Yeah, the test coverage number just tells you how many lines were executed. What percentage of the lines in the file got executed? That’s a pretty ham-handed way of looking at whether or not the test would break if you introduced a bug.

[0:07:50.6] MN: Right, make sure you’re writing those tests for the right reasons, not just for those numbers — for it to go up, right? You want to write tests for the sake of putting code and being able to refactor safely. That is the most difficult thing to measure, that particular part. We leave it up to the engineers to make sure that they’re doing the right thing.

The testing, not for the numbers, for them gains. But testing for the code.

[0:08:13.8] DA: Have you ever had a manager that tried to measure your productivity as a developer?

[0:08:21.4] WJ: Yes.

[0:08:21.2] MN: Yeah, definitely. That’s always a really, really cheap way to kind of know how much work someone is doing. Looking at GitHub to see how many commits you made or how many additions or removals of the code you have. And I find that just like, micro-manage-y but in the digital age, if that makes sense. It’s not like just checking up on you on task but more on looking at what you are coding or producing.

[0:08:47.0] DA: Right. I guess we talked with some folks earlier about remote work in this kind of age where there’s an increased interest from some type of management and software that measures your engagement with different tasks. And can give them a report when you’re on the internet or something like that. That’s like a separate, I guess, concern entirely in its own Orwellian sense.

Is measuring commits a good measure of productivity or lines of code? What are you really measuring with that?

[0:09:24.7] MN: Well, I think it’s more about their interaction with the codebase, it’s not always one to one like Bobby who introduced my lines of code is probably the person who is doing more of the work, right? You could probably get a yarn.lock file. If you're like the first person to introduce that bad boy and like that add so many lines of code. I don’t think that that would be the metric that you can throw it off with those kind of commits that happen.

I don’t know if commits or lines of code can do that. This is a question for you all. If you squash commits, does that actually reduce the amount of commits you’ve made in a report or does it still count? I imagine that it wouldn’t because it’s squashed, it’s not in the history no more.

[0:10:04.1] WJ: Yeah, I think that that would count against you. I had a manager who said, he felt like there are two ways of measuring performance. There’s the way you would measure a runner and the way that you would measure a baseball player. And managers, particularly who come from the business side and are used to sales, they try and measure you like a runner. And they’re like, “Well, we measure our sales people based on how many sales they did. Let’s measure our coders by how many lines of code that they write.” That is very simplistic and leads to just some terrible incentives like, “Let’s just commit a bunch of assets to version control. Then in that way, we’ll get credit for a lot of lines in code.”

It’s like very easy break. There’s baseball players, right? Baseball players, you can’t just measure with a single metric. So a baseball card is going to have a bunch of stats on it. His view was that engineers were like baseball players.

If you looked at any one metric, it would throw you off but if you looked at a whole suite of metrics, then it couldn’t give you a sense of whether this is a good player or a bad player. It may not tell you if this is a world champion but it will, at least, you get you in the ballpark.

[0:11:11.6] MN: Pun intended.

[0:11:12.0] DA: Ha-ha. I guess there’s like other variations on this too where I feel like the lines of code and commits thing is something that is like, I mean hopefully more theoretical. I know I’ve done it kind of as a joke or you know, just being like, “Okay, who is the top dog here?” I feel like measuring story points and stories can be like kind of a proxy for this as well. This same idea which maybe is another one of the metrics on the back of your baseball card, your developer baseball card — in a weird way.

[0:11:49.5] MN: Yeah, but the thing is that also, even if we looked at, say, the metrics of how many story-points a person has done, who’s to say that that person was assigned all those stories? Because that person knows that codebase very well and doesn’t pair with anyone because he’s a cowboy coder or likes to make sure that those features get shipped out. Do we know the kind of work that he’s doing? Is he doing like real hot fixes and adding another if statement to the list of if statements? You don’t really get that with story points either, it becomes like that baseball card where it’s like, “Oh Bobby has the commits and story points plus the test coverage” and all of that kind of bit.

[0:12:26.8] DA: I worked with a developer too that — they were pretty great. They knew the codebase very well, they were very productive. But they also had a skill that they were great at arbitrage of like looking at the stories that their team had pointed. And knowing which stories their colleagues had collectively overestimated. They would go for those stories, pick them up, crush it, and then be like, “You know, I shipped a good chunk of stories.” And then they, you know, like looked really good.

[0:13:01.1] MN: Oh wow, yeah that’s pretty clever. Like for that person to know the codebase and then not enough to get the sweet, sweet story point advantage. I do want to say, William, thank you for that analogy because now I can compare myself to a baseball player. I’ve never thought I’d see the day.

[0:13:18.1] DA: It’s dream come true. I got to hit a homerun with this one.

[0:13:20.4] MN: Hopefully we do. I think what I had done in the past though, me personally, in terms of looking at commits and lines of code, I’ve used that number to determine a particular client developer’s influence on the codebase and how well they know the codebase. That number was just like — I used it but it wasn’t like I’m expecting or reducing someone’s pay or like punishing them for the work that they did or did not do. It was just like, I didn’t have any science behind it; “Oh but you know Bobby’s done a lot of commits here. Let me see if I can talk to him about this particular piece of the code.”

And that leads me to some good findings. I mean I don’t have any scientific data to back it up but I always thought like, “Oh someone has a lot of lines of code at the codebase, they might know a thing or two.” Whether that was really true or not. It’s different.

[0:14:09.1] WJ: Big old grain of salt but I mean yeah, you know, it could be a clue for your investigation. I mean, I know if I’m looking at an open-source repo and it’s not clear who the maintainer is, sometimes I will go to that pulse tab and you can see it. They will show you who put the most lines of code in. Like the first number of commits over the longest period of time and you can see usually most open-source repos, there is like one person who is very clearly at the top of that and then a very long tail of other contributors.

[0:14:39.0] DA: Yeah, I guess as a manager too if you are trying to measure productivity, there is something that’s a little bit more intangible about — you might want to get at from all of these things like starter points and lines of code commits. You really want to know what is the impact of the change that they’re doing, what is the value that they are driving to the users. And how much engagement the features are getting and how useful are they. And how little bugs. That is a little bit harder to measure.

[0:15:07.1] MN: Even more abstract, like, how well is that person as a team player with the rest of the team. Like, that doesn’t get captured in the code either. But you want to make sure that Bobby is a good developer and is willing to help out when people are stuck on a problem. It is always a pleasant day when you get to pair with Bobby, for example, versus the amount of lines of code that a person does that could be thrashing all over the place and making sure that his work gets done before another person, that kind of stuff.

[0:15:34.3] WJ: I mean if you’re a team lead, you are probably, you’re hopefully writing fewer lines of coding. And spending more time investing in your teammates. That can be sort of a false indicator. You see somebody whose metric shows that they have delivered very few story points or that they have spent a lot of time on just one story, whatever. It could be that actually they are leader, maybe you are not even one that is formally recognized. But who is contributing so much in the code base or to the team by accelerating all of their teammates.

[0:16:01.7] DA: Right, by like working through them and lifting them up and having like outside impact. Yeah, I think that I have seen too is like on that thread of incentivizing individual performance versus team performance. People’s bonus compensation, if they have individual bonuses with really specific terms. That may incentivize them to act against the best interest of the team. Or what might lift up their teammates in favor of being the rock star hero that gets things over the line.

[0:16:35.4] MN: I need money, so if it is individual, baby, I’m going for it, I guess is the idea, right? What’s the incentive of you sharing the bonus if it’s individualized, I guess is the question? Not that I would really do that but that’s the thought that I had in mind. If it’s a team bonus then you are more likely to work with the team together to get the most useful features out and customer interaction. Hopefully that, I’m not sure how to capture those metrics to determine what teams get what bonus.

I mean I am sure it’s like customer satisfaction ratings and that kind of stuff. I’d be curious to hear any organizations that are doing team-based bonuses and what does that look like, what does that structure look like. And how is it set-up because it’s really an interesting thing. I’ve only heard of the idea being individualized, which is — can lead to like the story point arbitrage and lines of code committed and that kind of stuff.

[0:17:30.4] DA: Yeah, I guess we shifted some different awards or ceremonies at Stride from being individual-based to team-based and that’s been kind of cool. Instead of one person getting a meal, everybody gets a meal.

[0:17:41.7] MN: Yeah, exactly. Everyone — the whole team gets a meal for doing great work. With them ‘Stridies,’ yeah.

[0:17:47.1] DA: Got to get up that BMI.

[0:17:48.7] MN: Yeah, you got to get that BMI bro, order two pork chops. That’s what you got to do bro. You get a greasy sloppy Joe and cheeseburger on Stride. You get that BMI up, you get the vaccine, that’s my plan.

[0:18:02.1] WJ: Peanut butter and jelly on cheeseburger, that would do it.

[0:18:04.5] MN: Oh man, don’t even remind me. I need to go hit up that Burger Time in the Bronx, bro. They definitely got that burger on deck, oh my gosh.

[0:18:11.9] DA: — South Korean special.

[0:18:13.1] MN: Yeah, the South Korea special too, yeah. Well, you got a burger choice that does the peanut butter and jam, no?

[0:18:18.6] WJ: Yeah. Wait, they do that in the Bronx?

[0:18:21.3] MN: There is a burger called — I forget what it’s called but there is a burger place in the Bronx called Burger Time. And they have all sorts of wacky burgers. They have the peanut butter and bacon burger. They also have my personal favorite, it’s called the Excaliburger, where it’s a cheeseburger but rather than burger buns, it’s on two grilled cheeses. Guys, I know how to get my BMI out, I’m going to do it.

[0:18:45.1] DA: I don’t want your child to be an orphan.

[0:18:49.6] MN: The Excaliburger, check it out, it’s great. The last thing that I want to talk about really quick is a thing that we often do and as engineers is AB testing. And using that to our advantage to determine which way or direction we want to release this feature by exposing our users with the A-product or the B-product. And then figure out what’s the best. Do you all have any success stories with AB testing?

[0:19:15.7] DA: I, like, once I had a super successful AB test that I was very proud of because I defeated the designer. They really wanted a sticky button in a modal at the bottom of the screen. It was not working on different devices. I was like, “Well look, we’ll just put it on the top of the screen and then problem solved.” It’s not sticky, or, you know, just sticky at the top and then it’s not a problem. We did an AB test and then it really performed very differently than they expected. And even I expected that they were right. I didn’t think that was right but.

[0:19:50.6] MN: Defeated the designer, bro that’s great.

[0:19:55.4] DA: Yeah.

[0:19:56.0] WJ: Did you tell the designer that it was totally broken on some devices? Because I feel like maybe you were stacking your numbers.

[0:20:04.0] DA: Oh well, yeah. They knew. They knew. But they were like, they just wanted it that way and they’re like, “Oh, I don’t care. It’s not going to make that big of a difference.” And then I was like, “Look at this, people can’t add the food to the cart. They can’t buy anything so they just leave.”

[0:20:21.9] MN: Oh boy, no that will do it. Oh man, no. I imagine you use software to determine that people were not clicking the button when it was down below and unable to be seen versus when it was up top and ready to go. And I am sure those metrics definitely help out at the end of the day.

[0:20:37.6] WJ: Bounce rate, this is an important metric.

[0:20:39.9] MN: You want to make sure that people are going through the entire cycle in your product and AB testing is the best way to — or a great way to do that.

[0:20:47.5] DA: There are certain things that you might not want to AB test. You can go kind of overboard if you’re like addicted to AB tests.

[0:20:54.5] WJ: AB tests are expensive. Use them on things where — well, first of all, you need to have two cases that are worth testing against each other. And then you want to use that on things that are actually going to matter.

[0:21:05.6] MN: Don’t go AB testing everything, is that what you’re saying? Just like, all over the place or AB test and then in the A, you do another AB?

[0:21:15.0] WJ: It’s the whole alphabet, just put all the letters in there. I think, first of all, you need enough users to see the interaction to be able to make some kind of a conclusion. Because if you throw out an AB test on like some super niche feature, you could be waiting weeks to even get to like 30 plus users actually interacting with it. I mean if it is relating to check-out, if it’s relating to like something that is very directly going to impact sales. If you have a thing that’s going to reduce abandoned shopping carts then okay, great, AB test that. But you don’t need to AB test every PR.

[0:21:49.6] DA: Thank God. There are other ways to validate your assumptions besides AB testing like actually measuring the outcomes overtime. Or having some qualitative interview with the user. Like — “Is this sec or is it good?” And be like, “It’s good.” Okay, good.

[0:22:08.1] WJ: Maybe managers should try that with their managees.

[0:22:12.1] MN: Bullet point —

[0:22:14.2] DA: Is this guy terrible or is he okay?

[0:22:17.9] MN: I am going to AB test getting the peanut butter and jam burger and see how much BMI to get there. And then I am going to AB test that with the Excaliburger and see which one gives me more BMI growth. And then the one that does, I’m just going to buy a whole lot more of those until I hit the 30. And then, you know, I’ll get the vaccine. Yes, it’s going to happen. I can also wait until May 1^st or I could enjoy all the burgers I want.

That’s day — I don’t know man, this is a hard button to press, which one. I don’t know, I think I might have to go with the burgers.

[0:22:51.6] DA: I feel like we need to revisit this topic with a focus on making good choices with good metrics. I feel like we talked about some pretty gnarly ones.

[0:22:59.5] MN: A hundred episodes later. Hello, welcome to The Rabbit Hole. I have type-two diabetes from eating too many cheeseburgers.

[0:23:07.2] DA: Uh-oh.

[0:23:07.8] MN: We’ll see.

[0:23:08.5] WJ: Let’s be real, you just wanted those cheeseburgers. You weren’t going to find a metric to justify that anyway.

[0:23:12.9] MN: Exactly. You know it and y'all know me too well. Y'all know me too well.

[END OF INTERVIEW]

[0:23:20.5] MN: Follow us now on Twitter @radiofreerabbit so we can keep the conversation going. Like what you hear? Give us a five star review and help developers like you find their way into The Rabbit Hole and never miss an episode, subscribe now however you listen to your favorite podcast. On behalf of our producer extraordinaire, William Jeffries and my amazing co-host, Dave Anderson and me, your host, Michael Nunez, thanks for listening to The Rabbit Hole.

[END]

Links and Resources:

The Rabbit Hole on Twitter

Stride

Michael Nunez on LinkedIn

Michael Nunez on Twitter

David Anderson on LinkedIn

David Anderson on Twitter

William Jeffries on LinkedIn

William Jeffries on Twitter

Wendy’s

Burger Time