fair point, but this blog and your comment is about a summary sentence. if you read through to how the underlying log reports are constructed, those are very accurate (and also quite concise).
Hi paridso, we tackle the problem at a more foundational level.
AIops tools are designed to speed up resolution and reduce noise, but they act on a feed of alerts/incidents. So in a sense they are dependant on the quality of alerts generated by monitoring tools. And typically a human will end up drilling down into data like logs and metrics to determine root cause.
Our ML acts on the raw data, and has better coverage than typical log/metrics monitoring tools (including previously unknown failure modes). It also cuts time to root cause by generating complete incident summaries.
hi gingerlime, thank you. One thing we're working on is taking an incident signal from other tools and augmenting it with an incident report. We're starting with PagerDuty and Slack integrations as sources, but could see extending that to other APM/monitoring tools (DataDog could fit here). Of course we do have some overlap in the latter case (for logs & metrics). If you have something more specific in mind - let's connect.
There's more detail on the above mentioned plan here: https://www.zebrium.com/blog/youve-nailed-incident-detection...
Thanks for sharing the details. Looks interesting. I guess if we get Datadog alerts into slack and Zebrium listening on the same channel, then we can achieve something similar to what you described for PagerDuty. Right? definitely sounds interesting. I didn't think about that aspect.
My question was more on the integration side. We already send logs and metrics to Datadog, so if we want to add Zebrium into the equation then we need to also send those there. I was wondering if some kind of integration would allow Zebrium to consume logs/metrics from Datadog, or just to make the integration easier. Just a thought.
In any case, I'm definitely curious to take Zebrium for a spin :)
Right - we will consume alerts from other tools. I see your point about consolidating collection. We'll look for opportunities like this if we can. For now, we do try to make it easy and lightweight to set up our collectors. And many of our users do have multiple collectors/agents on the same clusters. Please contact us if you'd like to give it a try.
:)
founder here:
well, if there is a runbook for a known failure, we can trigger it via webhook. But auto-remediation for a previously unknown failure is of course a much different beast. Ambition for the future...
founder here:
you certainly can. Our goal is to minimize this need for you, but any team with experience already has some problem signatures/alerts for known issues, and we've tried to make it easy to capture those.
Our ML helps even this chore in one way - if you're building a signature relying on a log event - normally this is done with regexes, but you're at the mercy of a developer not changing syntax. Our ML will track these and ensure they signatures don't break if the log format changes in a future rev.
csears, founder here - could not agree more with your comments.
1. we learnt early to make user feedback easy (and immediately actionable) - they can quickly "like", "mute" and "spam". Or go more granular if needed.
2. This is an insightful comment. A few of our early users gave us similar feedback, and we've been hard at work. We'll soon be releasing a mode that takes an incident signal from your incident management tool such as PagerDuty or even Slack (often people create a Slack workspace per incident), and constructs a report around it.
3&4 are good points as well. Don't disagree about enrichment, just need to stage things.
Hi, one of the founders here: The service is designed to not require specific data models, because that does not scale, nor does it keep up with changes in application behavior.
Instead, the ML engine learns data structures, normal behavior of logs and metrics, and normal correlations between them for each app deployment on the fly. Then when things break it does a very good job of generating incidents. We make user feedback easy, so if we are "over-eager" in detecting a certain kind of incident, your response trains the ML quickly.
We do improve the ML engine with experience of course (and have added some user controls), but now have dozens of applications using us, and cumulatively have over a thousand successfully detected incidents under our belts.