Presentation: Counterfactual Evaluation of Machine Learning Models

Track: ML in Action

Location: Cyril Magnin III

Duration: 11:00am - 11:50am

Day of week: Tuesday

Share this on:


Stripe processes billions of dollars in payments a year and uses machine learning to detect and stop fraudulent transactions. Like models used for ad and search ranking, Stripe's models don't just score—they dictate actions that directly change outcomes. High-scoring transactions are blocked before they can ever get refunded or disputed by the card holder. Deploying an initial model that successfully blocks a substantial amount of fraud is a great first step, but since your model is altering outcomes, subsequent parts of the modeling process become more difficult:

  • How do you evaluate the model? You can't observe the eventual outcomes of the transactions you block (would they have been refunded or disputed?) or the ads you didn't show (would they have been clicked?) In general, how do you quantify the difference between the world with the model and the world without it?
  • How do you train new models? If your current model is blocking a lot of transactions, you have substantially fewer samples of fraud for your new training set. Furthermore, if your current model detects and blocks some types of fraud more than others, any new model you train will be biased towards detecting that residual fraud. Ideally, new models would be trained on the "unconditional" distribution that exists in the absence of the original model.

In this talk, I'll describe how injecting a small amount of randomness in the production scoring environment allows you to answer these questions. We'll see how to obtain estimates of precision and recall (standard measures of model performance) from production data and how to approximate the distribution of samples that would exist in a world without the original model so that new models can be trained soundly.

Speaker: Michael Manapat

Head of Conversion Products @Stripe

Michael Manapat manages Stripe’s Conversion Products group. He was previously a Software Engineer at Google, a postdoctoral fellow in applied mathematics at Harvard, and graduate student in mathematics at MIT.

Find Michael Manapat at


  • Deep Learning Applications & Practices

    Deep learning lessons using tooling such as Tensorflow & PyTorch, across domains like large-scale cloud-native apps and fintech, and tacking concerns around interpretability of ML models.

  • Predictive Data Pipelines & Architectures

    Best practices for building real-world data pipelines doing interesting things like predictions, recommender systems, fraud prevention, ranking systems, and more.

  • ML in Action

    Applied track demonstrating how to train, score, and handle common machine learning use cases, including heavy concentration in the space of security and fraud

  • Real-world Data Engineering

    Showcasing DataEng tech and highlighting the strengths of each in real-world applications.