More

Ajs1 · on March 14, 2022

https://www.zebrium.com/blog/how-cisco-uses-zebrium-ml-to-an...

Ajs1 · on March 26, 2021

appreciate your comment - spot on!

Ajs1 · on March 25, 2021

good point. This is the unfiltered response from the GPT-3 prompt, and the phrase "root cause" is a bit of an overstatement by GPT-3. However the collection of log events that are in the actual reports are far more descriptive. You can find examples here: https://www.zebrium.com/blog/using-gpt-3-with-zebrium-for-pl... and here: https://www.zebrium.com/blog/is-autonomous-monitoring-the-an...

Ajs1 · on March 25, 2021

fair point, but this blog and your comment is about a summary sentence. if you read through to how the underlying log reports are constructed, those are very accurate (and also quite concise).

Ajs1 · on June 12, 2020

Hi paridso, we tackle the problem at a more foundational level. AIops tools are designed to speed up resolution and reduce noise, but they act on a feed of alerts/incidents. So in a sense they are dependant on the quality of alerts generated by monitoring tools. And typically a human will end up drilling down into data like logs and metrics to determine root cause. Our ML acts on the raw data, and has better coverage than typical log/metrics monitoring tools (including previously unknown failure modes). It also cuts time to root cause by generating complete incident summaries.

Ajs1 · on June 11, 2020

hi gingerlime, thank you. One thing we're working on is taking an incident signal from other tools and augmenting it with an incident report. We're starting with PagerDuty and Slack integrations as sources, but could see extending that to other APM/monitoring tools (DataDog could fit here). Of course we do have some overlap in the latter case (for logs & metrics). If you have something more specific in mind - let's connect. There's more detail on the above mentioned plan here: https://www.zebrium.com/blog/youve-nailed-incident-detection...

gingerlime · on June 12, 2020

Thanks for sharing the details. Looks interesting. I guess if we get Datadog alerts into slack and Zebrium listening on the same channel, then we can achieve something similar to what you described for PagerDuty. Right? definitely sounds interesting. I didn't think about that aspect.

My question was more on the integration side. We already send logs and metrics to Datadog, so if we want to add Zebrium into the equation then we need to also send those there. I was wondering if some kind of integration would allow Zebrium to consume logs/metrics from Datadog, or just to make the integration easier. Just a thought.

In any case, I'm definitely curious to take Zebrium for a spin :)

Ajs1 · on June 12, 2020

Right - we will consume alerts from other tools. I see your point about consolidating collection. We'll look for opportunities like this if we can. For now, we do try to make it easy and lightweight to set up our collectors. And many of our users do have multiple collectors/agents on the same clusters. Please contact us if you'd like to give it a try.

Ajs1 · on June 11, 2020

:) founder here: well, if there is a runbook for a known failure, we can trigger it via webhook. But auto-remediation for a previously unknown failure is of course a much different beast. Ambition for the future...

Ajs1 · on June 11, 2020

founder here: you certainly can. Our goal is to minimize this need for you, but any team with experience already has some problem signatures/alerts for known issues, and we've tried to make it easy to capture those. Our ML helps even this chore in one way - if you're building a signature relying on a log event - normally this is done with regexes, but you're at the mercy of a developer not changing syntax. Our ML will track these and ensure they signatures don't break if the log format changes in a future rev.

Ajs1 · on June 11, 2020

csears, founder here - could not agree more with your comments. 1. we learnt early to make user feedback easy (and immediately actionable) - they can quickly "like", "mute" and "spam". Or go more granular if needed. 2. This is an insightful comment. A few of our early users gave us similar feedback, and we've been hard at work. We'll soon be releasing a mode that takes an incident signal from your incident management tool such as PagerDuty or even Slack (often people create a Slack workspace per incident), and constructs a report around it. 3&4 are good points as well. Don't disagree about enrichment, just need to stage things.

Ajs1 · on June 11, 2020

Hi, one of the founders here: The service is designed to not require specific data models, because that does not scale, nor does it keep up with changes in application behavior. Instead, the ML engine learns data structures, normal behavior of logs and metrics, and normal correlations between them for each app deployment on the fly. Then when things break it does a very good job of generating incidents. We make user feedback easy, so if we are "over-eager" in detecting a certain kind of incident, your response trains the ML quickly. We do improve the ML engine with experience of course (and have added some user controls), but now have dozens of applications using us, and cumulatively have over a thousand successfully detected incidents under our belts.