We have a data quality problem.
We run a large analytics and real time decisioning pipeline operating on thousands of different parameters per event. We source this data from many different source both inside and outside of our company. For some events, some parameters are optional vs required, for other events it's a different set. The mix of parameters for each event changes over time but not frequently.
Is there anything out there that can monitor the expected set of parameters for each event, including expected range for each parameters, and alert when either the parameter is missing or out of bounds?
I do not want to define this manually but would prefer something that includes machine learning and statistical probability.
Whenever something in the data goes wrong, is missing or incorrect value etc, we usually notice it far too late and decisions have been made based on incorrect values. Frequently the mistake is introduced in a different part of the company not even aware of our needs. Other times, some vendor is timing out or having API issues and values are missing.
There are too many places things can go wrong but the only central place where this is visible is the data collected for an event.