Engineering manager for Stripe Radar here. Today’s update has been almost a year in the making and we’re excited to help Stripe businesses fight fraud more effectively. Here's more on what's new: https://stripe.com/blog/radar-2018
I (and the entire Radar team) are on hand to answer any questions you may have!
One of the issues that I faced during my short stint building ML models for fraud detection in debit card transactions was dealing with class imbalance. I was not completely convinced that over sampling techniques or under sampling techniques would work. My initial experiments just resulted in more false positives. Just curious if you guys faced similar problems.
The other point I bring about is rather rhetorical - There are no open standards, model baselines and datasets in the Fraud domain. Compare building a model for fraud detection to building a model for image recognition or object detection There is a standard baseline, standard datasets and your model competes against that baseline. Because of the open nature of image recognition, the models have improved astronomically. I feel that a lack of such openness is fraud is holding back on innovation. I could be wrong in this assessment so please correct me if so.
I agree that the lack of standards and baselines in the fraud detection space isn't ideal. One example: some fraud products will build models using human labels as the target to be predicted. Radar, on the other hand, tries to predict whether a charge actually turns out to be fraudulent (we use dispute/chargeback data we get directly from card issuers/networks). These are in fact different problems and the fact that the industry generally doesn't have a consistent target makes discourse and comparisons more muddled.
(And on class imbalance: we spent quite a bit of time experimenting/analyzing how to deal with it—we found that sampling rate has a marginal impact on performance but not a huge one.)
The only problem I ran into with the old Radar was a situation where a card was declined, and the customer contacted his bank to clear it up. The bank said they had no idea, they didn't decline the charge. When I followed up with Stripe, turns out Stripe declined the charge, and it never got to the bank.
Is there a way to tell when this happens in the Dashboard? There wasn't at the time, but I'm hoping this is maybe now visible somehow? It's obviously helpful to know when trying to help a customer resolve the situation.
PM on Radar here. There is! If you see a payment blocked for high risk in the Stripe Dashboard, it means that Radar blocked it before the card was charged. That is, the customer’s bank would have no record of the charge. In addition to the risk evaluation, Radar also provides the primary reason a transaction is believed to be high-risk (for example, the card has been linked to an unusually large number of card payments in the Stripe network over the past 24 hours).
A problem we've run into with Radar is that it only kicks in when you attempt to create a charge, and not when you attach a card to a customer.
This means that if your business model involves "try before you buy" or usage-based billing, you'd better be sure to make an initial charge, otherwise the customer might incur costs before Radar decides to block the charges.
Even if you do require an initial charge, if you allow customers to change their credit card between recurring charges, the new card could be extra risky and "fly under the Radar" until the first charge attempt.
Are there any plans to offer fraud risk and blocking when attaching a card to a customer, or will still be limited to just blocking charges? With Stripe's new emphasis on recurring billing, it seems like this would be important.
We currently see Radar as a liability for us. It might block the occasional fraud and avoid a chargeback, but it also allows customers to incur costs with dodgy cards before we know they're dodgy, and then blocks charges outright before we know.
My perspective on this is colored by selling SaaS.
In software sold on a free-trial model, you assume most trials don’t convert (overwhelmingly due to declining to pay but with a bit of fraud) and then the cost to provision the service (COGS) is, effectively, a marketing expense. COGS in SaaS are typically negligible to low; this is why the industry is OK with providing services on, basically, a digital handshake. If you want to allow users to try out high-COGS services (or highly-abused services) prior to verifying capacity/willingness to pay, you’d need some way to credit score potential customers outside the context of a particular payment.
To date, we’ve generally focused the bulk of our ML efforts on things which apply to the majority of our users, but as we get better at customizing these technologies to specific industries at scale and even on a per-account basis, we could certainly imagine applying them in contexts that are more relevant in your model. I’d love to hear more detail about your use case; feel free to email me (my HN username at stripe.com). If we get closer to shipping something that is probably interesting, we’d be happy to give you a heads up.
Most of our ML stack has been developed internally given the unique constraints we have for Radar. Among other things, we need to be able to
- compute a huge number of features, many of which are quite complex (involving data collected from throughout the payment process), in real-time: e.g. how many distinct IP addresses have we seen this card from over its entire history on Stripe, how many distinct cards have we seen from the IP address over its history, and do payments from this card usually come from this IP address?
- train custom models for all Stripe users who have enough data to make this feasible, necessitating the ability to train large numbers of models in parallel,
- provide human-readable explanations as to why we think a payment has the score that it does (which involves building simpler “explanation models”—which are themselves machine learning models—on top of the core fraud models),
- surface model performance and history in the Radar dashboard,
- allow users to customize the risk score thresholds at which we action payments in Radar for Fraud teams,
- and so forth.
We found that getting everything exactly right on the data-ML-product interactions necessitated our building most of the stack ourselves.
That said, we do use a number of open source tools—we use TensorFlow and pytorch for our deep learning work, xgboost for training boosted trees, and Scalding and Hadoop for our core data processing, among others.
Broadly speaking, what approach do you use to "build simpler 'explanation models'" from the more complicated "core fraud models"? Do you learn the models separately over the training data, or does the more complicated model somehow influence the training of the simpler model?
Why you so stubborn on IP address? Its not a holy grail! I use proxy for some years now and many times I want to buy something on the frontstore “powered by Stripe” and my card is declined due to “unknow error”. Moment I turn off my vpn, transaction goes thru. I can exect this to be a huge problem for Stripe or anyone deciding on fraud attempt greatly basing it on IP. These days if i find a cool product and see “powered by Stripe” I simply end up on Amazon purchasing same product for similar price. Worst part — your clients don’t even know!
I’m sorry that you had this experience. We vehemently agree that any one signal (such as IP address or use of a proxy) is a pretty poor predictor of fraud in isolation. We are trying to move the industry towards holistic evaluation rather than inflexible blacklists; not everyone behind a TOR exit node is a fraudster, for example.
While we can’t fix the previous experience you had, we’ve rebuilt almost every component of our fraud detection stack over the past year. We’ve added hundreds of new signals to improve accuracy, each payment is now scored using thousands of signals, and we retrain models every day.
We hope these improvements will help. We want our customers to be able to provide you services; that’s what keeps the lights on here. We’d be happy to look into what happened if you have specific websites in mind—feel free to shoot me a note at mlm@stripe.com.
The rough idea is that you look at all the decisions made by the fraud model (sample 1 is fraud, sample 2 is not fraud) and the world of possible "predicates" ("feature 1 > x1", "feature 1 > x2", ..., "feature 10000 > z1," etc.) and try to find a collection of explanations (which are conjunctions of these predicates) that have high precision and recall over the fraud model's predictions. For example, if "feature X > X0 and feature Y < Y0" is true for 20% of all payments the fraud model thinks are fraudulent, and 95% of all payments matching those conditions are predicted by the fraud model to be fraud, that's a good "explanation" in terms of its recall and precision.
It's a little tough to talk about this in an HN comment but please feel free to shoot me an e-mail (mlm@stripe.com) if you'd like to talk more.
We’ve been working on Radar, and releasing updates and improvements, continuously since we first launched. What we’re announcing today is (1) a completely revamped machine learning system (which we couldn’t release in pieces—we needed to finish every layer of the stack before we could launch it, though we’ve been running it in beta for a percentage of users since late last year) and (2) a new package of features specifically designed for teams working on fraud prevention.
I (and the entire Radar team) are on hand to answer any questions you may have!