I just completed this one, it was to be done till sunday but I was really excite...

saurabhn · on Feb 11, 2017

That's really cool! Quick, unsolicited 2¢:

1. Your data is severely imbalanced, so accuracy is a very misleading metric to use here. From what I see, you have a 1:20 imbalance (malicious vs non-malicious distribution). This affects both the metrics and induces bias in classification.

2. I'd like to add to the other comment asking you for calibration curves and see what your minority class performance looks like in terms of precision, recall, f-beta, average precision (area under precision-recall curve).

3. Then, try and see if resampling helps or hurts the predictive performance- it typically speaks to the level of noise and small disjuncts in the data.

4. I see you've done a 0.2 split for test-train, but try and eliminate split bias by using stratified cross validation. This would ensure that you didn't just get lucky with random seed = 42 and get a really great test set.

All of these can be implemented using sklearn and imbalanced-learn [0]. Not included- deeper dive into cost sensitive and adversarial techniques. Let me know if you have any more questions and keep up the good work!

[0]: https://github.com/scikit-learn-contrib/imbalanced-learn

Source: PhD in imbalanced machine learning.

Faizann20 · on Feb 12, 2017

Thank you so much for these suggestions, I'll surely try these and will let you know.

One thing to add, the data is not that much imbalanced. I only used 100000 non malicious and 50,000 malicious so its 2:1 actually. I didn't use all the non malicious queries.

Thanks again.

saurabhn · on Feb 15, 2017

Sorry, just saw this after all these days. Do check out Adversarial Machine Learning as a really cool next step.

WmyEE0UsWAwC2i · on Feb 10, 2017

Those are very good results (99%), I would like to see a calibration curve [0]

[0] http://scikit-learn.org/stable/modules/generated/sklearn.cal...