Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Data quality monitoring?
2 points by canterburry on June 29, 2015 | hide | past | favorite | 4 comments
We have a data quality problem.

We run a large analytics and real time decisioning pipeline operating on thousands of different parameters per event. We source this data from many different source both inside and outside of our company. For some events, some parameters are optional vs required, for other events it's a different set. The mix of parameters for each event changes over time but not frequently.

Is there anything out there that can monitor the expected set of parameters for each event, including expected range for each parameters, and alert when either the parameter is missing or out of bounds?

I do not want to define this manually but would prefer something that includes machine learning and statistical probability.



Just to expand on this...

Whenever something in the data goes wrong, is missing or incorrect value etc, we usually notice it far too late and decisions have been made based on incorrect values. Frequently the mistake is introduced in a different part of the company not even aware of our needs. Other times, some vendor is timing out or having API issues and values are missing.

There are too many places things can go wrong but the only central place where this is visible is the data collected for an event.


If I understand what you are doing, you are creating a data record for an event, and you want to make a quality assessment automatically for each event?

For one thing you need automated tests that tell you if an invariant is broken. You'll need to be able to express some conditions manually, for instance, if the downstream people have a problem when "Y" happens and "Y" can be precisely specified, this is no problem.

Get in touch with me off line if you want to know more.


I think that is the general gist. What we definitely can say is when X data element was missing from event A, this was a problem for system Y. We may even be able to create a feedback loop from error logs which could feed training data back to the algo.


The answer to that is to build something like a trouble ticket system for errors in the log. You'll probably need to capture 10,000-30,000 judgements to construct useful machine learning models. In the mean time a power tool for organizing the repair work will get the job done.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: