To all data scientists and machine learning researchers & practitioners:
- What do you do first when you get a new dataset for machine learning?
- How do you analyze your data to find relevant features?
- How do you identify data quality problems?
- Which statistical tests do you perform on the dataset?
- Which visualization techniques do you use to investigate the data?
I'm working on a library that helps people to find potential problems in datasets and ML models (which is not ready to share/publish yet), so I'd love to get some feedback on what you think are best practices for preparing and validating datasets for ML.