Thanks for taking the time to reply! I'm relatively new to working on this type of system (large scale, event driven) and half posted because I know there are people on HN way better than me at this, and was curious about their opinions.
In the end, what's the difference between a log and a metric? Is one structured, and one unstructured? Is one a giant blob of text, and the other stored in a time series db? At the moment I guess I'm "logging my metrics" with structured logs going into Loki which can then unwrap and plot things.
You and the other commenters have given me the vocabulary to dig more into this area on the internet though. Thanks!
As a person who has worked in and around logging and big data processing for 16 years now, including almost a decade working as a senior in professional services (currently a global security architect) directly for one of the largest big data companies, here is my opinion on logs vs metrics:
A log entry should capture an event in time, for example: a person logging in, a failure, a record of a notable event occurring, etc. These should be written at the time they occur when possible, to minimise chance of loss and to minimise delay for any downstream systems that might consume the logs. Arguments for batching could easily be made for systems generating very high volumes of logs.
Conversely, a metric is a single value, point-in-time capture of the size of something, measured in units or with a dimension. For example: current queue depth, number of records processed per second, data transfer rate in MB/s, cpu consumption percentage, etc. These can/should be written periodically, as mentioned in TFA.
> In the end, what's the difference between a log and a metric?
Essentially, a log entry is the emission of state known by an individual code execution path at the point the log entry can be produced, whereas a metric is a measurement of a specific runtime execution performed by the system.
Emits a log entry capturing the processing state known when the statement is evaluated. What it does not do is separate this information (a time-based attribute in this case) from other log entries, such as "malformed event detected" or "database connection failed."
More importantly, putting metrics into log entries forces timing to include log I/O, requires metrics analysis systems to parse all log entries, and limits the type of metrics which can be reported to be those expressible in a message text field.
Maybe most important of all, however, is that metrics collection and reporting is orthogonal to logging. So in the example above, if the log level were set to "error", then there would be no log-based metric emitted.
This is a reasonable first pass answer, but there's more nuance to this...
> What it does not do is separate this information
Logging at scale should really be structured, which means that you can trivially differentiate between different types of log message. You also get more dimensions all represented in that structure.
> limits the type of metrics which can be reported to be those expressible in a message text field
This is another example, logging shouldn't be text based ideally. You might have a summary human readable field, but metrics can easily be attributes on the log message.
The more I work in this area the more I'm realising that logs and metrics are pretty interchangeable. There are trade-offs for each absolutely, but you can convert logs into metrics easily (Datadog does this), and with a bit more effort you could turn a metric into logs if you wanted to (querying metrics as rows in a SQL database is handy!).
Metrics collection is also not necessarily orthogonal to logging, it depends on your system. From a server, you might have logs pushed to an external source and metrics pulled from the server by Prometheus, but that's just implementation details. You can also have logs pulled from log files, and metrics pushed to a statsd endpoint.
I've worked on mobile apps where metrics get aggregated locally and then pushed as log events to the server with one log event per metric and dimension set, only for the server to then typically turn them back into metrics.
It's good to understand the tradeoffs, the technology, whether you're using push or pull, where data is spooled or aggregated, data costs, etc. But this stuff is all pretty malleable and there's often no clearly right answer.
I think what you're saying is that you can make a logging system LARP metrics. At the end it's logging on fd 1 and 2 and metrics are usually over http, but ofc you can dump "metrics" into stout, it's not as practical with what tools are built for what.
In my local fun projects that run on my machine I might dump metrics into the logs because it's practical, but it doesn't make it "right".
I log over RPC and send metrics over RPC, thinking about logs being to a file descriptor and metrics being over HTTP is focusing too much on a particular implementation and not enough on the concepts.
Also it's not about logs role playing as metrics, I'm saying you can literally turn one into the other, in both cases, and there are valid use cases for that.
I know and understand the reasons for that rule, but it’s one of the first ones I disable in linters. The theoretical benefits in the context of the systems I work on aren’t worth the extra friction.
> I'm relatively new to working on this type of system.
> In the end, what's the difference between a log and a metric?
Don't let me put you down, but writing a logging advisory blog post when you don't know the difference between a log and a metric seems like a peculiar thing to do.
But I'm not shaming lack of knowledge, we all had to learn somehow.
The question is whether you want to do your aggregation by unit time at the application level, or at an observability layer. You're absolutely right that the end user of metrics wants to see things grouped by time - but what if they want to filter down to "events where attribute X had value Y, in 10 second increments" but you had decided to group your metrics by 15 second increments without regard to attribute X?
Various companies, both in-house for big tech and then making this more widely accessible, started to answer this question by saying "pump all your individual logs in structured form into a giant columnar database that can handle nearly arbitrary numbers of columns, and we'll handle letting you slice and dice metrics out of any combination of columns you want. And if you have an ID follow the session around between different microservices, and maybe even all the way to the browser session, you can track the entire distributed system."
Different people might say that Datadog, Honeycomb, or Clickhouse (and the various startups backed by Clickhouse as a database) were the ones to make this pattern mainstream, and all of them pushed the boundaries in one way or another - nowadays, there's a whole https://opentelemetry.io/ standard, and if you emit according to that, you can plug in various sinks made by various startups, and choose the metrics UX that makes the most sense for your use case.
I'm a huge fan of Honeycomb - when I know a certain issue is happening, I can immediately see a chart showing latencies and frequencies, and click any hot spot to filter out the individual traces that exhibit the behavior and trace the end-to-end user journey, with all the different logs from all the systems touched by that request. And I can even begin this discovery from a single bug report by a single user whose ID I know. It's not just metrics - it's operational support. And if I'd pre-aggregated logs, I'd have none of this.
But of course, there are systems where this doesn't make sense! Large batch jobs, high-performance systems with orders of magnitudes more events than a standard web application... it's not one size fits all. That said, I think knowing about modern observability should be part of every developer's toolkit.
I love how open and non-defensive this comment is :)
There are a few ways to slice this, but one is that logs are human-readable print statements and are often per-task. E.g. if you have 100 machines, you don't want to co-mingle their logs because that will make it harder to debug a failure. Metrics are statistics and are often aggregated across tasks. But there are also per-task metrics like cpu usage, io usage etc.
They can both be structured to some extent. Often storage strategies might differ but not necessarily. I think at Google the evolution of structured logging was probably something like (1) printf some stuff, (2) build tooling to scrape and combine the logs, (3) we're good at searching, but searching would be easier if we just logged some protos.
I think logs are basically self-explanatory since everything logs. To understand why you would want separate metrics, consider computing the average cpu utilization for your app across a fleet of machines. You don't want to do that by printf the CPU usage, grep-ing all the logs, etc. You could try to do that with structured logs, and I'm sure some structured logs SaSS companies would advocate that.
If you're new to this space, I really liked the book Designing Data-Intensive Applications.
A quick summary that does it for me: a log is something you read, a metric is something you measure.
Usecases:
Log: search, get context, read
Metric: measure, plot dashboards, define alerts
My theory for the concepts being so mixed up together: you use both to troubleshoot, and I think the old school way to emit metrics was to parse logs and turn that into measures.
You might want to check out this very nice article on reservoir sampling, which discusses its application to logging: https://samwho.dev/reservoir-sampling/
I'm not sure I want to weigh in on "log" vs "metric"... but I did want to add some thoughts on logs in general.
If you need to "log" something to give users feedback as the system is running, it may be less of a log and more of a progress or status output.
Logs to me are things which happen and I want to be able to trace later, so summarizing or otherwise dropping logs that come in quickly in succession would be a problem. If I need to filter I pipe to grep, otherwise I can just save it all and read through it later.
Status messaging, which may be informative about your process is useful, and if its goal is to be observed real-time, then yea. A message or two a second seems like a good goal for consistency.
These are just two very different use cases to me. And generally I find the former critical to get right, while the later may be nice to have and may lead to discovery by nature of making it more accessible.
Metrics model some measurable, quantifiable state.
In high volume systems both can then be observed through various sampling techniques. A key item is that sampling is good to handle separately to application logic creating those signals as it may change over time or be dynamic.
> The moment of capturing a measurement is known as a metric event
Which suspiciously reads like a log.
In practice, a metric is an aggregate of events (the "metric events") when you're not interested in the individual event but, but in the aggregate itself. For practical reasons this is not implemented with logs but with more primitive technical events emission.
This is not fundamentally incompatible notions. If you do an electrocardiogram, you might be interested in your BPM, but it is deduced by the full log of each beat. The segregation we do in computing is more practical than fundamental.
Completely agree on the confusing terminology there. IMO that should be:
> The moment of capturing a measurement is known as a metric sample.
The mental model I hold is the metric is the actual value. This may be discrete (e.g. a packet counter) or some continuous value (e.g. a voltage in you ECG example). It can then be observed at some time/value delta interval or summarised into other time series based on what you're hoping to capture.
Read the SRE book, maybe some of the highlight posts only. It will give you the right jargon and a lot of wisdom that you can then simplify for your use cases.
This was shared with me years ago by another developer I worked with. I still reference it today as I continue my external battle with the complexity demon.
What a fantastic write-up. As a Brisbane native and software developer I often feel similarly to the author about Brisbane's software dev scene. Brisbane so often feels like a backwater, with the big dogs down in Melbourne and Sydney, and the 'peak of industry' in the US.
I'd love to move to Seattle and work for Amazon or something to get 'relevant industry experience' but what I'd really love to do is make a go of it here because - like the author - I believe Brisbane is secretly still the best city in the world ;-)
I lived and worked as a dev in Seattle for 8 years before moving to Sydney. I want nothing more than for Australia to have a thriving tech scene but I haven’t seen much progress in that area since I moved here 5 years ago. I still love it and have no plans to go back. I just wish there was more opportunity here and not so much constant pressure to move back to the US for increased salary and challenge.
Funny. I lived in Seattle for 5 years before I moved to Sydney, where I lived for 5 years. That was a different era though, tech wasn't the industry it is now and the internet still felt new. I moved down in 2003 and my American accent helped me land a job I wasn't qualified for (having self taught myself some php and java in Seattle, mostly working as a bartender though). In 2005 I started a small software shop with some friends. Back then (2003) the Ruby user's group was too small to get a reservation at a pub so we'd have to partner up with the Smalltalk guys. Rails came out a year or so later and that changed.
I got back into web stuff when I moved to the states and have been up and down the stack many times since, but I have a ton of nostalgia for the stuff we did back then. Web 2 was an annoying new buzzword and we were still mostly writing software for kiosks, device drivers in C, bridging that with Lua, and using Flash for the interface b/c everybody else in the space was using shitty C++ Motif interfaces. . . . memory lane.
Imagine that Newtown and the Inner West are a lot different than when I lived there, but I do miss that time.
I just don’t see _as much_ self-directed ambition or obsession? Going to a meetup in Seattle or SF in the early 2010s there were serious obsessives. Masters of domains like Go or JavaScript and someone from Sequoia at the Startup Weekend. Always flocks of folks looking to start their next business. That same bug just never hit here?
This I find weird, surely there are people who can sense opportunities unlockable by tech and Australia is not at all easier or any less expensive than the U.S., I still can’t quite put my finger on it. For me there’s still a magical cultural element to a place like SF, and to an extent – Seattle, when it comes to creating new opportunities.
Two factors I think: (1) obsessives have a higher likelihood to follow their obsession into immigration-a factor which works to the advantage of certain parts of the US, to the detriment of most of the rest of the world; (2) Australian investors tend to have a more risk-averse attitude, they will offer less money and demand a bigger stake for it, many of them will prefer later stage startups to the truly early stage ones
Lots of factors involved, some more regulatory others more cultural. One is that Australia’s property market has been so hot for so long it sucks up a lot of investment; the US market, while recently being quite hot as well, has historically been much more mixed (the US had a big price drop around the time of the GFC, Australia saw some declines but they were a lot smaller). Another is the US legal system tends to be more borrower-friendly in bankruptcy, foreclosures, etc, making people more willing to take out loans to fund their business ideas
Australia in some ways is the opposite of the US. Too much regulation and not enough effort to help people start businesses. It really needs to change and they’re missing a big opportunity to make the start up scene better. Just as long as we don’t do it while throwing out sensible regulations.
I live down the street from Amazon's relatively nice suburban office (you couldn't pay me to step foot in Seattle).
Let me save you the trip, you don't want to work for Amazon at the money they pay. They would have to 1.5x it or maybe even double it to make it worth the suffering of working there.
Life is short-- work somewhere else, or failing that, on your own thing :)
As someone who’s from Brisbane but spent the last 7 years in London you’re 100% correct. Brisbane is the best city in the world. I’m excited to eventually move back.
"I believe Brisbane is secretly still the best city in the world"
Personally the 3 times I visited Brisbane, were all in all quite neutral for me, not great, not bad. But friends had way worse experiences and when I found a iconic backpackers book, "No shitting on the toilet", I had a good laugh about those passages:
"A friend of mine would never leave a place until he’d had a good time there. Another friend would not leave a destination until he had learnt something encouraging about the people and their culture. Both are currently stuck in Brisbane."
So .. I would have been stuck there as well.
So please no offense about your home town.
I love Queensland. And Bluey. And would give your hometown a chance again.
But I do know people who never ever want to go there again. (But it also has been some years.)
Oh I can 100% see where all of that comes from too.
I think a lot of Brisbanes secret beauty is well hidden from people just visiting. The temperate rainforests, glasshouse mountains, some of the best beaches in the world all within an hours drive. The strange birds, the general attitude of the public. I think it's all quite nice. My only personal gripe is that I think it's far too hot in summer!
I'm also extremely biased though, so take my opinion with a grain of salt. Brisbane does have an awful lot of mediocrity too, but I'm still proud of it, and keen to show it off in 2032 with the Olympics!
I'm using Supabase to track progress and prestige, but you're right — localStorage would work better for a pure idle feel. Might switch to a hybrid setup soon: local by default, backend if you want promotions and Discord drama
Surely Airbnb - a company that runs a website - has the capability to put a text post on their own website. Then they'd own the content, and people looking for it could find it easier? It's not a revolutionary concept either, Facebook has one:
The reason for this blog post is "recruit engineers". Not every engineer is going to visit blog.airbnb.com, but presumably a lot of them are already on Medium.
They even close the article with,
> We continue to solve interesting problems around LTV every day (and as more insights come up, we’ll keep sharing them on our blog). Can you see yourself making an impact here? If so, we encourage you to explore the open roles on our team.
I think it's amazing the python community has like 5 half baked solutions to this problem, all of which are either abandoned, poorly monetized, or have a janky UI. I mean we have: Zappa, Chalice, Serverless, and if you attempt to do it yourself, do you use Cloudformation, CDK, or AWS SAM?
Following with interest, I think there is room for better tool in this space.
Nice writing and summary there.
As an author of stelvio I felt the same. It's not that there're no tools. But none of them is covering enough AWS surface while being truly focused on Python devs. Some are general infra tools (CDK, Pulumi) focused on infra people, where python is not primary language, some are Python tools like Chalice (likely abandonded) and some are meant for Python devs but use json (Zappa).
I 'm working hard on Stelvio hoping it can become the best tool to do infra in Python.
Btw. If you'd be willing to share more details about what you want/miss in python cloud tooling could you drop me an email at michal at stelvio.dev ? I'm happy to talk. Applies to other interested people too!
Next.js on Vercel. I don't get enough hits to take me out of the free tier, so I'm fine staying here for now. If it became absurdly expensive, I'd try to either self-host, or move to something else.
All the blog posts are just written in markdown and would be easy to migrate, but some of the fancier stuff would be harder.
Luckily if you just write drivel like me, the free tier lasts a long time
I recently made the backend of a hobby Python project run entirely serverlessly on AWS, and reduced my bill from ~$300 a year to $0 a year.
There seem to be a million and 1 ways to make serverless python websites, but none of them easy. Thought I'd post in case anyone is going through the same process and this might help, or in case there are other ways I should have done this, and can be enlightened by the comment section.
I first attempted to use ruff for a small project ~2 years ago, and at the time felt that it wasn't quite good enough to replace the: black+isort+whatever linter combo we were using at work.
I've used it a few times since then and now I'm a big proponent of using only ruff. I think most of its value comes from:
1. Being fast (or at least fast enough that it's not annoying).
2. Replaces the linting/formatting combo of multiple tools, reducing the cognitive load for the developer.
In the end, what's the difference between a log and a metric? Is one structured, and one unstructured? Is one a giant blob of text, and the other stored in a time series db? At the moment I guess I'm "logging my metrics" with structured logs going into Loki which can then unwrap and plot things.
You and the other commenters have given me the vocabulary to dig more into this area on the internet though. Thanks!