Ask HN: Data quality monitoring?

canterburry · on June 29, 2015

Just to expand on this...

Whenever something in the data goes wrong, is missing or incorrect value etc, we usually notice it far too late and decisions have been made based on incorrect values. Frequently the mistake is introduced in a different part of the company not even aware of our needs. Other times, some vendor is timing out or having API issues and values are missing.

There are too many places things can go wrong but the only central place where this is visible is the data collected for an event.

PaulHoule · on June 29, 2015

If I understand what you are doing, you are creating a data record for an event, and you want to make a quality assessment automatically for each event?

For one thing you need automated tests that tell you if an invariant is broken. You'll need to be able to express some conditions manually, for instance, if the downstream people have a problem when "Y" happens and "Y" can be precisely specified, this is no problem.

Get in touch with me off line if you want to know more.

canterburry · on June 29, 2015

I think that is the general gist. What we definitely can say is when X data element was missing from event A, this was a problem for system Y. We may even be able to create a feedback loop from error logs which could feed training data back to the algo.

PaulHoule · on June 30, 2015

The answer to that is to build something like a trouble ticket system for errors in the log. You'll probably need to capture 10,000-30,000 judgements to construct useful machine learning models. In the mean time a power tool for organizing the repair work will get the job done.