117. Logging

Welcome back to the Rabbit Hole, the definitive developers Podcasts live from the boogie down Bronx. Today we'll be talking about logging; the uses of it, what it is, and how to structure your log messages so that it makes sense when you're looking through your logs. Logging is an important piece of functionality that exists, and something that we all run into, time and again. To start off, we dive into the difference between centralized versus distributed logs and the benefits of each. We also discuss the most important rules to follow when logging, what not to include, and which events are most critical to capture. Next, we debate the time periods for storing logs, the importance of log rotation, and the most popular tools for logging. So don’t miss out on this informative episode as we find our way into the rabbit hole!

Key Points From This Episode:

Defining logging and why it is an important discussion.
The difference between centralized versus distributed logs.
Two main use cases for logging: debugging and input for a monitoring system.
When is it most sensible to log and capture events.
Importance of logging with context, and being specific.
Why it is important not to include sensitive information in logs.
The “who, what, when, where, and result” rule for logging.
Understanding the need to periodically deleting old logs.
The standard time for storing logs.
Determining when logs might not be the right tool for the job.
Our favorite logging tools.
And much more!

Transcript for Episode 117. Logging

[0:00:01.9] MN: Hello and welcome to the Rabbit Hole, the definitive developers Podcasts live from the boogie down Bronx. I'm your host, Michael Nunez, our co host today,

[0:00:09.9] DA: Dave Anderson in Sunny North Carolina.

[0:00:12.9] MN: And our producer,

[0:00:13.9] WJ: William Jeffries in Hyderabad, India.

[0:00:16.9] MN: Today we'll be talking about logging; the uses of it, what it is, and how to structure your log messages so that it makes sense when you're looking through your logs.

How are you guys? How's it going? How's everything?

[0:00:31.9] WJ: Life is good. It's like 8 p.m. over here. I don't know what time it is over there.

[0:00:35.9] MN: It is 10:30 a.m. on the Bronx.

[0:00:38.9] WJ: Wow.

[0:00:39.9] DA: Yeah, same in North Carolina.

[0:00:42.9] MN Everything is fine. Not a sunny as I imagine it is in North Carolina though. That's so good. Do you guys want to just — let's jump right into logging really quick. Logging is an important piece of functionality that exists. I run into it all the time, and I imagine the both of you and everyone listening as well. Let's talk about centralized versus distributed logs. Can someone defined the two types of logs and why is centralized or distributed log good?

[0:01:14.9] DA: Or I mean, like, maybe we could take a step back and just talk about logging in general? So I mean, like the the basic point of logging that he might do in like a really simple system is just do like a print statement. Just print out to standard out. But a lot of programming languages have their own library specifically for logging like Log4j in Java. Very similar to Log4j, there’s just a standard logging library in Python that you can just import and configure and you know that's important for, like, production [inaudible] the print statement in in a better way. So you can actually like have like, rotation of the log file if it gets too big or configure if it's going to standard out or it's going to a file. And then, you know, if you're putting it to a file and you have a bunch of servers than that kind of comes into the point where you're thinking about like centralizing the logs were just distributing logs. Like, do you wanna have all these files in a bunch of different servers? So then there's some issue, then you have to, like, go into each box and be like, “Did this one have a problem?” You know, [inaudible] in the log. “Okay, no this one's good, let me go the other one."

[0:02:27.9] MN: I imagine that's more time consuming though, right?

[0:02:29.9] DA: Yeah, exactly.

[0:02:30.9] WJ: If you horizontally scaled your app servers across 10 different machines and you start having errors, it could be that one of the instances of your app server is down and the other nine are fine. So having to go SSH into each machine and check the logs locally is pretty painful. Whereas if you have a centralized logging tool like [inaudible], or what have you then you can search all of them at once.

[0:02:59.9] MN: Right. So, like, if the ninth box has an error, you don't have to search through each one and then find out on the 9th one that there's an issue. With having centralized logging you can searching and see exactly which one had the error and then, you know, SSH into that box to figure out why that happened.

[0:03:19.9] WJ: Yeah. I mean, it seems like there are like two big use cases that I can think of for logging. The first one's debugging, which is, I think, the main thing that we've been talking about, where you want to know generally what your app is doing so that when it's not working properly, you can investigate why and then the other one is as input for a monitoring system where you want to know how many orders are getting placed. Let me log every time and order happens and then search on that and see, "Oh, well, there were 100 log statements saying that an order happened in the past hour. So there must have been 100 orders this past hour.

[0:03:59.9] DA: Right. Yeah, when you're talking about, like having like a specific message that says, “Ok, an order is placed." That implies that you've, like, done the thought about, like designing the message ahead of time so that you have a really reliable indication that, if some order was created. So, like one thing, a challenge I always find is like actually making sure that there is a good useful message there that, like, has some information that lets me know specifically what happened.

Sometimes it's enough to just say the order was created or an order was creative. But then you don't know which order was created, who created it, all those kinds of like meta detail information may not be in the message. So you might want to think about, like, what kind of things are important that you might need to think about when that order is created and when you're like looking for the information in the log

[0:05:02.9] WJ: Right and it's like, is there a better way to get the same information? Like placing orders is kind of a weird example, because probably that’s in your database. You could query your database and it would be way better than querying the logs.

[0:05:14.9] DA: Right. And that's really the system of record is like what data made it into the database? Because a log could totally lie to you. It could say like, “An order is created,” but you know, it's actually being logged on line before it’s being [inaudible] to the database.

[0:05:31.9] MN: And you may have had, like, a rollback or something that I actually didn't save into the database. Is there like logging design? Like when thinking about logs, as you mentioned before, right? You want — you can't just say, “An order was created.” You could say something like, “An order was created for member I D,” and then, you know, drop the member ID. Or, “Created the order ID in the log.” Is that like, safe to do?

[0:05:56.9] WJ: Yeah. I mean, I think that, like, things that have to do with your application state, like the whole application makes sense to log. Like when you're at boots or when it shuts down like those are important things to capture. And then more generically, any time that you're making a database transaction, probably logging the sequel, both the query and result, is sensible. And then you can also from that infer database transaction times, maybe just logging all of the HTP requests that come through so you know when requests are being made and what responses they're getting returned, do some introspection on those packets to see — wait and see to see payload, to see status codes.

[0:06:45.9] DA: Yeah, totally. I’m, I was just, like, curious about this and so I just pulled up in article. Love the title here. It’s called The 10 Commandments of Logging.

[0:06:56.9] MN: Oh!

[0:06:57.9] DA: This like number six in the list like, “Thou shalt log with context.” Like getting the information about the specific transaction or situation that's at hand instead of just saying, you know, “This failed.” Like, what specific thing failed and what was inside of it?

[0:07:19.9] MN: Yeah that makes sense. “Thou shalt log with context”. That's a good one. Yeah, I think like one thing you don't want in the logs, and I think this may have happened before in time, is just like logging, plain text like username or passwords when someone successfully logs in. It's not something that you really want in the logs, because we shouldn't have that sensitive information printed out in plain text in a log somewhere. That's just weird. But definitely have context when you're actually printing logs, pretty much is a good rule to follow.

[0:07:52.9] WJ: I think “who, what, when, where and result” is a good rule. So, like that user name, like whatever command was run, some kind of time stamp so that you could do a time series graph or just know if that was recent. Like where, meaning what server it came from, or what database, or what page in your app and then the result meaning like was there an exception that got thrown? Did it return normal payload? If your payloads are a reasonable size, what was in the payload. I think of you return really large payloads, that’s probably not a thing you want to pollute your logs with unless you are willing to pay a lot of money for log storage.

[0:08:33.9] MN: Oh, yeah.

[0:08:34.9] DA: Right. Which, like, yeah, is kind of an internal issue. I was working on an application where, you know, we at one point we were confused because our deployment started failing and we noticed that the reason why our deployments were failing is because we didn't have any room for the docker images and when we looked into why we didn't have room for the docker images, it was because we just had all of the logs on some, you know, random directory on the server. Like whenever it was set up, you know, years before any of us were there and all of the logs forever had been stored there.

[0:09:14.9] MN: Oh no,

[0:09:15.9] WJ: Enter log rotation.

[0:09:18.9] MN: Yeah.

[0:09:19.9] DA: Right. yeah, exactly.

[0:09:21.9] WJ: Very important tool, periodically deleting old logs.

[0:09:24.9] DA: Yeah, exactly.

[0:09:26.9] MN: How long do you store logs for? Like what would be the standard?

[0:09:30.9] WJ: How rich are you?

[0:09:32.9] MN: Have you guys seen like magnitudes where they're cut for months or years? Because, like, if you think about I think like the Board of Education is required to hold your transcript, your high school transcript forever. Which is like really crazy.

[0:09:52.9] WJ: That’s insane.

[0:09:53.9] MN: Yeah, that's really crazy. So, like, imagine — and that's like physical, like space. Do you guys know how long logs are usually kept before you have to be insanely rich?

[0:10:02.9] WJ: Can you give me a copy of your high school transcript, please?

[0:10:05.9] DA: I mean, like I feel like that kind of makes sense because that's like the final artifact. That's like — That’s the record that's written for the database or what have you, like the final transcript. Where as like the log messages of your education is like, you know, all of the homework and notebooks and, you know, finger paintings that you did in art class. Like if the Department of Education was responsible for hanging on to all of those messages and not just the final product, that would be like a pretty ridiculous burden. I've gone through, like, boxes of that stuff myself, like from college, and whatever.

Finally thrown all of that away because it didn't mean anything to me anymore. Like I was looking at my notes, and I just literally couldn't understand the math that was in them, which I think is a good metaphor for the logging situation too. Because, like with a log message there's a time and a place that, you know, the log message has the most value and won't be really helpful in helping you understand what the outcome of the system was, what the behavior of the system was at that time.

[0:11:15.9] MN: Right and those logs needs to constantly be rotating in a space where the context of time exist. So like, for example, you may have had, you know, those boxes of stuff from a long, long time ago. It didn't make any sense. But, you know, a month after those things were created, it would have made more sense to keep around.

[0:11:38.9] DA: Right. I would go through that and, like try and understand, like, you know what errors I had. Like what questions I got wrong on homework or a quiz in order like fix the bugs in my brain and like get a better grade on the test.

[0:11:53.9] WJ: I think that, like what I've seen is startups that don't really have any compliance and don't really have a lot of money will keep logs for however long they can keep them locally without paying any extra money and that's kind of one extreme. And then the other extreme would be like in the banking industry, where people are super paranoid about regulation and they just keep everything for seven years because they're some law somewhere about that being the most ridiculous amount of time someone could require you to keep a thing and they're just really paranoid and then most people probably somewhere in between.

[0:12:31.9] DA: Yeah, that kind of make sense. But I guess like when you're configuring like your production system, which is probably the place where you're most concerned with log retention. But you probably don't really care too much about like your, any of your test environments to the same degree.

[0:12:46.9] WJ: I think for most people, it's very unlikely you would need more than one year's worth of logs.

[0:12:52.9] MN: Yeah. That sounds about right.

[0:12:53.9] DA: Yeah. Because there could be like that there is a persistent issue that you didn't realize happened and production maybe, like there's some data corruption that was happening slowly over time.

[0:13:04.9] WJ: Like maybe you would want to generate some kind of special report after, like over the course of a full year see like how your application performed over the course of a year. Unless you have some real external requirement placed on you for storage, then it's unlikely that you would ever need to look back further than that. I mean, if you have a production issue that you need to debug, probably it's happening right now. Or you just fixed it and, you’re like, “Wow! How do I make sure that never happens again?”

[0:13:37.9] MN: And then once it's fixed, the new — the logs will exist in time and hopefully you don't have that error happen again. So I guess, have people people ever died from too much logging? Like is there a cap into the amount of logs you actually write? Like, are you going to write every single entry that's happening in your application? Like what's a good — or when does it become noise? I guess, is like the question.

[0:14:05.9] DA: I mean, like, one thing that you might want to consider is what level you're bucketing the messages into. So if there is like something that's kind of more noisy, you could bucket that into like a debug level log instead of like a higher level log like info or like a warning error and then in production you can disable that and just have that for developers locally.

[0:14:28.9] WJ: My thinking is like, by default everyone should be logging any kind of exceptions that get thrown and not caught, any sequel transactions, any HTP transactions, and any application default activities. Like if you're running a Django app or spring app or whatever, there's going to be some stuff that the framework itself provides. All of that you should capture automatically and then any logging that requires actually instrumenting your code, I would recommend waiting until you need it. And then when you're like, “Oh man, I think I could have debugged this production issue better if I had this particular log.” Then you go back and edit the log.

Because when you get into this like guessing game, where it's like, You know, this would probably be a useful thing to know,” then that's kind of a dangerous path where you can sort of quickly end up with a lot of log statements in your code and then a lot of output in your log files that’s just noise.

[0:15:28.9] MN: Yeah.

[0:15:29.9] DA: Sounds like you're talking about [inaudible].

[0:15:31.9] WJ: A favorite acronym on this podcast.

[0:15:35.9] MN: Oh, yeah.

[0:15:37.9] DA: Just for life, really. You know, in general, there's a lot of things I'm not going to need. So why am I trying to get it? But yeah, that totally makes sense. If you put off until later, you know, writing more of us log statements, then you'll you'll have a better idea about, you know, what kind of things may be important or relevant to log in those situations.

Although that being said, also like, you don't want to go completely in the other direction and not log anything at all, because when there is an issue, you won’t have anything to go on at all. So I like the idea of starting with, like, the boundaries of the system. Like you were saying with the database rights and HTP requests and things like that. Like, what’s going on?

[0:16:23.9] WJ: Yeah, and then if there's something in particular that you want to be able to monitor, like if you want to be able to graph something that happens. Like, I don't know, if you want to know how often people are abandoning carts, you could have a log that shows any time anybody gets far enough in your application to create a cart to add something to a cart and then also log any time anybody purchases all the items in the cart, and then you can start to do a diff between those two and see how many carts are getting abandoned and then graph that. That could be a useful metric if you want to drive, if you want to drive that down and have fewer abandoned carts.

[0:17:01.9] DA: Yeah. Although then, like you may also consider, like what other tools are available for something like that. Like, is that something that's more appropriate for, like, something more analytics driven and, like Google Analytics where you gonna have goals and stages and see what your conversion rates are. Like maybe a log isn't the right tool for that kind of job.

[0:17:31.9] WJ: Yeah, that's a good point. Google. Analytics is kind of a really comprehensive client side logging tool that's already designed to instrument whatever metrics you might want.

[0:17:43.9] DA: And there are other tools too that you might consider for logging more application level metrics and numbers. Like StatsD, or Datadog or Prometheus, where you can, like collect specific timing metrics and things like that in a more discreet and meaningful way.

[0:18:01.9] WJ: Does anybody have any favorite logging tools?

[0:18:03.9] MN: I recently used the Sumo Logic. That one is pretty fun. You could just like search what host you expect these errors to be in. You can use regular expression to find, like certain messages and it was — it’s pretty standard logging system. But that's the one I'm using right now.

[0:18:21.9] DA: Yeah, I've used Splunk before. It's it's pretty powerful.

[0:18:28.9] MN: It’s blunt. You can use, like, key value pairs, right? Like to find things?

[0:18:33.9] DA: Yeah, you can, like, actually parse out values from the log and filter on those things. Even in the message itself, you could, like, parse it out and pull out more information that might be relevant to you. Pretty cool. But you do have to be like, pretty intentional about, like how you design your log messages to begin with in order to get good use out of it.

[0:18:58.9] WJ: I think Splunk is my favorite logging tool if someone else is paying for it.

[0:19:05.9] MN: Is it really expensive?

[0:19:06.9] WJ: Oh, my God, yeah.

[0:19:08.9] MN: Oh, man.

[0:19:09.9] WJ: It seems like ELK and Splunk are the two big players and which one you pick depends on how much money you have.

[0:19:15.9] DA: Right. And what’s ELK again? That's Elasticsearch?

[0:19:21.9] WJ: Logstash and Kibana.

[0:19:24.9] DA: So it's like all different pieces that kind of work together to make your logs more usable.

[0:19:29.9] WJ: I think they're calling it the Elastic Stack now, because they added file Beats to it. They added Beats and there are a whole bunch of extra products that that company has created that sort of integrate the test stack and it kind of ruined the acronym.

[0:19:50.9] DA: ELK

[0:19:51.9] MN: ELKB. All those tools sound like they do — they’re able to search through and send like a centralized entry for the logs in order to find any errors or information that you want from your logs. Like much, much faster than SSH’ing into and doing a graph and like that seems like a lot, a ton more work.

[0:20:13.9] DA: Oh, yeah. Yeah, definitely. And also like doing some basic graphing and maybe even some monitoring.

[END OF INTERVIEW]

[0:20:21.1] MN: Follow us now on Twitter @radiofreerabbit so we can keep the conversation going. Like what you hear? Give us a five-star review and help developers just like you find their way into The Rabbit Hole and never miss an episode. Subscribe now however you listen to your favorite podcast. On behalf of our producer extraordinaire, William Jeffries, and my amazing co-host, Dave Anderson, and me, your host, Michael Nunez, thanks for listening to The Rabbit Hole.

Links and Resources:

The Rabbit Hole on Twitter