Logging vs. instrumentation

jacques_chester · on Feb 8, 2016

I think, like most people, I will disagree with this:

> In my opinion, my thesis from GopherCon 2014 still holds: services should only log actionable information.

This assumes that I have future knowledge of which logged events will be actionable.

I used to work in the classifieds department of a newspaper. The rule of thumb was that the cover of the newspaper was worthless by 11am, the classifieds by 2pm.

But that's not quite true. Every single edition of a newspaper retains a small probability of turning out to be very important at some unknown future time. Hence the practice of archiving.

So it is with logs. In my day job I have on several occasions turned to raw logs in order to reconstruct the history of unexpected events.

What metric, to take an actual example, tells me that a user was trying to evade export control by hopping to a VPN halfway through a session? Which gauge tells me that an error was caused by the interaction of third party javascript and an obscure web browser?

For myself, I draw the boundary differently. Logging is for events with identity, events that will be inspected individually. Metrics are for events that will never have identity and will only be understood statistically.

sagichmal · on Feb 8, 2016

IMO the author buried the lede:

> Finally, understand that logging is expensive. I’ve seen entire teams of absolutely brilliant engineers spend years building, managing, and evolving logging infrastructure. It’s a hard problem, made much harder by overburdening the pipelines with unnecessary load.

It's ultimately a piece about engineering cost versus benefit. There's a real cost associated with logging every event with identity, to create an audit trail. Maybe that's necessary in your domain, for compliance reasons or otherwise. In which case it falls into the bucket of machine-read structured logging, and you design for it. But for lots of domains, it isn't strictly necessary.

jacques_chester · on Feb 8, 2016

In which light, allow me to rephrase the argument:

Log as much as you can afford.

Given that the cost of logging is low and continuously falling, the default decision should be "log everything".

"Log only actionable messages" is analogous to advice like "always use bit masks instead of booleans", or only "use two digits to encode years". It assumes algo-economics that are no longer the normal case.

Terr_ · on Feb 8, 2016

IMO it's also worth considering the issue of synchronous vs asynchronous logging during the process.

Throughput gets better but latency's a tougher nut to crack.

sagichmal · on Feb 8, 2016

> Given that the cost of logging is low and continuously falling

Everything gets cheaper in time, but the cost of logging is actually quite high.

tonyarkles · on Feb 8, 2016

I don't know why you're getting downvoted, because in my mind you're right in some circumstances. More than once, I've found logging-related performance issues (i.e. logging is near the top of a profiling-driven performance optimization task).

One immediate example that comes to mind was code that had a bunch of logging statements like log.debug(boost::format("%d %x etc", param, param)). Normally the code would run with debug logging turned off, but there's a pretty brutal problem there! The formatted string is always going to be processed, which the logger will immediately discard. Turns out one of these formatted strings actually took a fair bit of processing to generate and was hurting the system.

kedean · on Feb 8, 2016

That's a solved problem. Your logger should be capable of formatting on it's own, and take the arguments to the format string as variadic arguments. This is how Java's ubiquitous SLF4J interface is defined. If the log level isn't low enough, then the arguments are tossed but never composited.

azth · on Feb 8, 2016

And it's even better with languages with macros, like Rust, Scala, even C++. There won't be any function call to begin with if the corresponding logging level is not enabled, only a branch.

tonyarkles · on Feb 10, 2016

I replaced it with a macro!

_urga · on Feb 8, 2016

The cost of logging is actually quite low relative to the enormous cost of engineering hours spent debugging a system that's flying blind.

jacques_chester · on Feb 8, 2016

I don't know who downvoted you, though I still disagree.

If I was writing an embedded system, or a system with soft-to-hard realtime requirements, or one where flat straight line performance was the only requirement that counted, sure, I might ditch or constrain logging.

These systems might be common by manufactured volume. But most programmers will never touch one.

You're probably thinking of I/O costs. Again, these aren't as a brutal as they used to be -- SSDs swallow bursty writes with aplomb and when those aren't enough, log-structured filesystems are good at smoothing out writes.

As for myself, in the work I do in my day job, I print out to STDOUT/STDERR and let the platform wick those lines away to a central firehose.

ktRolster · on Feb 8, 2016

   > If I was writing an embedded system, or a system 
   > with soft-to-hard realtime requirements, or one 
   > where flat straight line performance was the only 
   > requirement that counted, sure, I might ditch or
   > constrain logging.

I've worked on embedded systems, and logging is still extremely important, if not more important because of the difficulty of seeing what's going on (it's hard to reproduce events that only happen in a truck while you're inside your office).

In one case, a team had been trying for months to debug a problem, but when proper logging was added, the issue was easily fixed.

sagichmal · on Feb 8, 2016

> in the work I do in my day job, I print out to STDOUT/STDERR and let the platform wick those lines away to a central firehose.

I think this is the correct approach, but (to the nearest approximation) ~every organization I've worked at that does this has hit saturation limits of "the platform" in extremely short order. Especially if you're doing audit-trail-style logging, in my experience, this becomes the primary bottleneck of your infrastructure very quickly.

jacques_chester · on Feb 8, 2016

Did you come up with particular tactics? Any insights on platform bottlenecks?

wglb · on Feb 8, 2016

I am curious about this. Can you give specifics about the costs of logging in this context?

Eridrus · on Feb 8, 2016

Not the parent, but I know our production web servers have very limited logging since they were already highly optimized and handle a huge amount of traffic, so logging was taking up something like 15% of all CPU time (and since they were CPU-bound, 15% of cost), so we have stats like those described in this article instead.

sbov · on Feb 8, 2016

This is a false dichotomy. If you log things you don't need brilliant engineers spending countless hours on a logging infrastructure. All you need to do is log things to local disk, and delete some when you start to run out of space. You can write a fairly trivial script to do so.

pjc50 · on Feb 8, 2016

That limits your logging rate to your available disk bandwidth.

_urga · on Feb 8, 2016

That's not a problem for 99.99% of applications.

tyingq · on Feb 8, 2016

You could set up something to allow a dynamic change in logging level. Doesn't help with blip type issues that happened in the past, but helps with issues that arise and continue.

There's a defacto standard for many daemons using SIGUSR1 as a prompt to close/reopen log files. You could settle on some other non-used signals to change the log level.

wglb · on Feb 8, 2016

I quite agree with this. In extreme circumstances, you don't know what you need until some disaster happens.

Consider why there are cockpit voice recorders and flight data recorders. Almost all of that data is thrown away.

In a more extreme example than the newspaper, consider some sort of data breach investigation.

oasisbob · on Feb 8, 2016

My primary disagreement with that thesis would be the utility of logging things that clearly represent transactions, especially to the outside world.

eg, HTTP access logging is good and useful (vs non-structured debug timings on every fragment rendering)

eg, A mail server reporting delivery of a particular message into a mailbox.

I think your "events with identity" description is a nice succinct version of this same idea.

jakozaur · on Feb 8, 2016

Sounds like it was written by someone who lacks good infrastructure to collect and search logs.

E.g. it is very useful to collect all user actions and other major actions. So later when user face some problem and sends support a message you know what was going on. You can send user a message that his issue was fixed instead of asking the user for a reproduction steps.

The aggregate numbers poorly tell you a story what has happened to this particular user and at scale there are bugs that just some ppl will hit.

You can process logs and produce the metrics out of them, but you can't do the opposite. Saying logs are just human readable actionable errors seems backwards to me.

vbezhenar · on Feb 8, 2016

> In my opinion, my thesis from GopherCon 2014 still holds: services should only log actionable information. That includes serious, panic-level errors that need to be consumed by humans, or structured data that needs to be consumed by machines.

But what if I need some information to debug a problem in the future? I log some related information and when error happens, I can look to the logs and get that information. This might help to debug a problem. Ideally I would like to have a values for all variables at any moment of time. Practically I can't have it, but I can get some values, which, I think, are more important. So later, when I debug a problem, I'll have more information. And if I don't ever need that, no problem, disks are cheap and logs are small (if done properly, there's nothing worse than searching for a needle in a 100GB log file).

For my programs I developed a simple rules. Log everything important with DEBUG level (but it shouldn't be too much, don't log 10MB data for each HTTP request). Log everything that happens in the system at a "higher" level with INFO level (like "add record with text {}"). Log everything unusual with WARN level (like user validation failed). Log every unexpected but recoverable (application can serve other requests) error with ERROR level. Log unrecoverable error with FATAL level and shutdown application. Works fine for me. And TRACE level just for debug, it's usually turned off, so it might produce any amounts of data.

allengeorge · on Feb 8, 2016

I too thought that his logging recommendations were weak. Simply logging an error (even with context) doesn't tell you much about how you entered that state.

I follow these guidelines (similar to yours):

  - ERROR: Failed assertions, unrecoverable conditions
  - WARN: Exceptions (request is toast, but system works)
  - INFO: Important actions (entered a certain state)
  - DEBUG: Information associated with the actions above

Note that we too don't log full requests, etc. at DEBUG - that's way, way too much. If you really want to do that? TRACE, but that's usually important for the developers of those libraries. We're also lucky in that we use logback, so we can change the log level of the system at runtime very easily.

In production we usually run on INFO and above and the logs aren't overwhelming - though we're constantly trying to improve the quality and consistency of the messages.

henrik_w · on Feb 8, 2016

I have found tracing to be very useful when troubleshooting problems. Tracing (or session-based logging) lets you activate detailed logging for specific customers. The system I worked on processed SMS messages for mobiles. There you could enable tracing on a specific mobile phone number, and whenever that subscriber sent an SMS, you got more detailed logs for everything affecting that SMS. Really a great tool. I've described it in more detail here: http://henrikwarne.com/2014/01/21/session-based-logging/

ktRolster · on Feb 8, 2016

Your point is especially true when a bug is difficult to reproduce. In those cases, good logging can mean the difference between a few hours and a few months to solve a problem.

_urga · on Feb 8, 2016

1. Logs should contain as much information as possible in order to debug. If you have to ask a user to reproduce a bug so that you can insert extra logging statements then I think you're doing it wrong. You want to be able to go straight to your logs and have enough information to find the problem immediately without further delay. For example, how are you going to debug a syncing error in a file sync app if you're not keeping detailed logs as to scans, inotify events, incoming and outgoing transactions etc.?

2. You almost want your log entries to tell a story, you want to spend nearly as much time on your log entries as on your code. I think the more you get this right, the better your software design will be.

3. You also want to be able to grep your logs easily using a transaction and user ID. For example, if you just fetched a batch of emails over IMAP, you want to have a transaction ID to describe the session, and include this in all related log entries for the session. You also want to include an account ID in all related entries so you can see all fetches for a given user.

4. Your log entries should do as much work for you as possible. For example, if your log entry logs a filename or string that contains unusual Unicode characters, then you might want your log function to also log what Unicode normalization form the string contains (NFC, NFD, Mixed). This can make otherwise complex bugs trivial to solve.

5. Logs should also be distributed, in the sense that you don't ship all your logs to a central location, but rather have each server manage its own logs. If you need to query your logs, then send a query out to the relevant servers and get them to do the work for you. If logs are always a fraction of a server's data output then you can easily manage gigabytes of log data per server this way.

sagichmal · on Feb 8, 2016

> Logs should contain as much information as possible in order to debug.

There's a huge operational and engineering cost buried in that statement, which the article is trying to unearth and provide better solutions for.

> 5. Logs should also be distributed, in the sense that you don't ship all your logs to a central location, but rather have each server manage its own logs.

Though it does give you some nice properties, this is counter to a lot of accepted best practice... is there a good distributed log query system? `dsh` doesn't cut it, for the record :)

_urga · on Feb 8, 2016

> There's a huge operational and engineering cost buried in that statement, which the article is trying to unearth and provide better solutions for.

Sure, there's always an operational and engineering cost to logging. I think the original comment makes that plain.

goldbrick · on Feb 8, 2016

> Logs should also be distributed...

Perhaps there was an unstated assumption on your part, but how are my app servers to know which one handles which query? Is there some sort of log coherence system going on? Are you running Hadoop on your app servers?

_urga · on Feb 8, 2016

Well, it depends on whether you're designing a distributed system or not, what kind of sharding scheme you use if any etc.

goldbrick · on Feb 8, 2016

> distributed

You keep on using that word... but seriously, thanks for the non-answer, now I can be confident you don't know what you're talking about.

_urga · on Feb 9, 2016

"now I can be confident you don't know what you're talking about."

I guess your original comment made clear that you weren't really interested in an answer beyond dealing with "unstated assumption" and that you want to persist in thinking of logging from a centralized point of view. You just wanted to be "confident".

But if you're seriously interested in discussing further to see what I'm working on, feel free to drop me an email on [email protected] and we can setup a Skype face-to-face.

kordless · on Feb 8, 2016

I know something of logs.

Logging is figuring out how to keep any and everything in a small to mid-sized time window which you hope (but don't really know) may be useful later in a very small time window.

Monitoring is having something watch instrumentation (metrics) for anomalies. This may include views you build up over time on the previously mentioned logs.

If there was ever an industry that could benefit from AI, it's logging.

wslh · on Feb 8, 2016

My two cents: if you need to instrument native Windows applications, including Win32 and COM, my company develops three mature GPLv3 libs:

1/ Deviare In-Proc (like Microsoft Detours): https://github.com/nektra/Deviare-InProc

2/ Deviare Hooking Engine (even novices can instrument apps): https://github.com/nektra/Deviare2

3/ RemoteBridge (access internal COM and Java apps) https://github.com/nektra/RemoteBridge

All these libs include examples and there are additional use cases in our blog http://blog.nektra.com

akeating · on Feb 8, 2016

I think the author raises some valuable points but does not emphasize the key point about logging; that _logging has relatively limited applicability_. If you think of logging that way, you log everything, but keep only a time-bound window of log data. Around two weeks is plenty because most of the times you need to access the detailed log information is same or next day. I think careful attention should be paid to annotating log messages in various parts of the stack with some piece of data that can be used to correlate messages that are important to your use case. E.g. make sure a customerId or invoiceId or something logs through various levels of the stack.

pascalmahe · on Feb 8, 2016

Interesting!

I must confess, I'm currently trying to get hard data on how our users are using our product and the thought of logging every request had crossed my mind. As an entry point to to finding the right solution and also because our product is hosted on the client's server and rarely under heavy load (we never have more than a dozen users connected at the same time).

I'm now going to look into instrumentation...

tptacek · on Feb 8, 2016

Is anyone else here using Prometheus? I'd be interested in hearing about experiences with it.

catwell · on Feb 8, 2016

I am also interested since I saw the lightning talk about it at FOSDEM (video: http://mirror.onet.pl/pub/mirrors/video.fosdem.org/2016/h221...).

For what it's worth, FOSDEM itself uses it for the real time monitoring of its infrastructure.

u14408885 · on Feb 8, 2016

The author talks about Graphite and Prometheus as instrumentation systems but they're for collecting data. What do people recommend as a linkable library for collecting metrics in the service process? (before it is collected). A quick google gave me 'go-metrics' and 'dropwizard metrics' but the code I'm working on will require linking with a native C library.

tptacek · on Feb 8, 2016

The author is a Go developer; if you are too, a good starting point is to use expvar and then whatever tool your collector provides for collecting expvars. Besides being in the standard library, there's the advantage of there being lots of tools that speak expvar.

u14408885 · on Feb 9, 2016

Thanks for that. 'expvar' looks like a great hidden gem for go development. Unfortunately most of our services aren't written in Go so we might have to write our own if there isn't a handy library we could write bindings for.