Hacker Newsnew | past | comments | ask | show | jobs | submit | yansoki's commentslogin

Love this...meaning simply reusing a quick fix is definitely not ideal to help identify root causes...LLMs have come a long way and I feel with adequate tooling and context(the rich ticket data you mentioned), they could really be a great solution or at least provide even better context to developers


Thanks...thinking about using AI to learn about what is actually "important" to the developper or team...tracking the alerts that actually lead to manual interventions or important repo changes...this way, we could always automatically send alerts to tiers...just thinking


You could, but I personally wouldn't for a few reasons.

The first is that it's there are simpler ways that are faster and easier to implement. Just develop a strategy for identifying whether page are actionable. Depends on your software, but most should support tagging or comments. Make a standard for tagging them as "actioned on" or "not actionable", and write a basic script that iterates over the alerts you've gotten in the past 30 or 90 days and shows the number of times the alert fired and what percentage of times it was tagged as unactionable. Set up a meeting to run that report once a week or month, and either remove or reconfigure alerts that are frequently tagged as not actionable.

The second is that I don't AI are great at that kind of number crunching. I'm sure you could get it to work, but if it's not your primary product then that time is sort of wasted. Paying for the tokens is one thing, but messing with RAG for the 85th time trying to get the AI to do the right thing is basically wasted time.

The last is that I don't like per alert costs, because it creates an environment ripe for cost-cutting by making alerting worse. If people have in the back of their head that it costs $0.05 every time an alert fires, the mental bar for "worth creating a low-priority alert" goes up. You don't want that friction to setting up alerts. You may not care about the cost now, but I'd put down money that it becomes a thing at some point. Alerting tends to scale superlinearly with the popularity of the product. You add tiers to the architecture and need to have more alerts for more integration points, and your SLOs tighten so the alerts have to be more finnicky, and suddenly you're spending $2,000 a month just on alert routing.


Thank you...reading from you guys has been great so far


So it's really a manual and iterative process....means there should be room for something to be done


You learn pretty quick. Like CPU I don’t alert on it, I do on load average which is more realistic. I’m also solo dev, so I do it on the 15min avg and it need to be above a pretty high threshold 3 times in a row. I don’t monitor ram usage, but swap instead. When it trigger it usually something need to be fixed.

Also check for a monitoring solution with quorum, that way you don’t get bothered by false positives because of a peering issue between your monitoring location and your app (which you have no control over).


I wasn't building one exactly for me, but I believe not all devs have a team available to monitor the deployments for them...and sometimes centralized observability could really be a plus and ease the life for a developper....just being able to vizualise the state of your multiple vps deployments from your single pc without logging into you provider accounts should count for something I belive....this is without any form of anomaly detection or extra advice about your deployment state...I wanna believe this is useful but again the critique is welcome


Agreed... while monitoring isn't everything, not just alerts, but a dashboard and reports can provide insights and early warnings before things start falling over.

A single 80% CPU spike isn't anything to worry about by itself... but if it is prolonged, frequent and accompanied with significant influence on p95/99 latency and response, it could be a critical warning that you need to either mitigate an issue or upgrade soon.

I would be more inclined to set limits on response latency, or other metrics that would be impactful to users as to what is tolerable, and use that as critical alert levels. The rest you can use reports on to get say hourly or half-hourly windows in terms of where performance hits are, what the top values were for latency in addition to p95/p99, etc.


Thanks tracker...very insightful


Yes that's true....but my frustrations lmade me wonder if others really faced these problems, and before attempting to solve it, I want to know about solutions available...but lol everyone seems to say it's hell


As the YAGNI says - you ain't going to needed it. Until you do, only then you act and fix that particular problem. It's that simple. So unless there is an actual problem, don't worry about it.


So ideally, a system that can learn from your infrastructure and traffic patterns or metrics over time? Cuz that's what I'm thinking about and your last statement seems to validate it...also from what I'm getting no tool actually exists for this


I would not want to use that for alerts (automatically) but I'd consider it for suggesting new alerts to set up or potential problems. If it was at all accurate and useful.


Okay, thanks a lot, didn't see it like this


You are the tool. The human element.


This is an incredibly insightful and helpful comment, thank you. You explain exactly what I thought when writing this post. The phrase that stands out to me is "constant iterative process." It feels like most tools are built to just "fire alerts," but not to facilitate that crucial, human-in-the-loop review and tweaking process you described. A quick follow-up question if you don't mind: do you feel like that "iterative process" of reviewing and tweaking alerts is well-supported by your current tools, or is it a manual, high-effort process that relies entirely on team discipline? (This is the exact problem space I'm exploring. If you're ever open to a brief chat, my DMs are open. No pressure at all, your comment has already been immensely helpful, thanks.)


You've hit the nail on the head with "devilishly hard." That phrase perfectly captures what I've felt. What have you found to be the most "devilish" part of it? Is it defining what "normal" behavior is for a service, or is it trying to account for all the possible-but-rare failure modes?


Thanks for the sanity check. In your experience, what's the biggest source of the noise? Do you find it's more of a tooling problem (e.g., bad defaults) or a people/process problem (e.g., alerts not being cleaned up)?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: