You are viewing content from a past/completed QCon -

Presentation: Instrumentation, Observability & Monitoring of Machine Learning Models

Track: Predictive Architectures in the Real World

Location: Cyril Magnin I

Duration: 1:20pm - 2:00pm

Day of week:

Slides: Download Slides

This presentation is now available to view on

Watch video with transcript

What You’ll Learn

  1. Hear about the gap between ML researchers and software engineers, why ML is harder than it seems, and how to bridge the gap.

  2. Learn about what it takes to put a ML model into production.

  3. Find out some of the tips and tools to start with when approaching ML


Production machine learning involves intentionally deploying and running some of the ugliest, hardest-to-debug spaghetti code that you have ever seen (i.e., code that was generated by a computer) into the critical path of your operational environment. Because so much of machine learning code has an academic origin and most experienced practitioners have primarily worked in offline, batch-oriented computing environments, there is often an impedance mismatch between devops and machine learning practitioners that causes unnecessary pain for everyone involved. In this talk, we're going to go deep into the monitoring and visibility needs of machine learning models in order to bridge these gaps and make everyone's working life a bit simpler, more pleasant, and more productive.


What's the focus of the work that you do?



I'm working on our services infrastructure after spending a year working on our search infrastructure. To improve our search ranking algorithms, we created what's called a mixer service, which allows us to combine and re-rank results from multiple search backends using custom machine learning models. In the process of deploying the mixer service in production, I started getting interested in building services in general-- Kubernetes, Prometheus, all that good stuff. So that takes up a good fraction of my time these days.


Your talk is about instrumentation, observability and monitoring of machine learning models. How are you going to attack this subject?



Deploying machine learning models is like intentionally deploying spaghetti code into your production environment. A model is the worst, hardest-to-interpret code that you have ever written. Figuring out what is going on with this stuff when it starts misbehaving, throwing errors, or giving you back nonsense predictions is especially challenging, and I think that is largely unappreciated by the academic machine learning community. I want to talk about the pain I’ve experienced with this stuff and how it motivated me to get very serious about monitoring and observability.


What are some of the tips that you're going to talk about for helping people to be able to trace or observe a running ML model?



I will cover the basics of modern monitoring and observability, tools like Honeycomb and Prometheus, the stuff I consider an essential prerequisite for production machine learning.


What do you want someone who comes to your talk leave with?



I think that there is a great deal of fear around putting machine learning into production, which is healthy. For a decade, I underestimated how hard this was to do, because I always had other people doing much of the hard work for me. I want to bring some of what I’ve had to learn in the last year back to the machine learning community in a way that fosters better working relationships between the ML researchers and the software engineering community.



Is it correct to say that this talk is focused on production readiness of a machine learning model?


Yes, exactly. What does your production environment need to look like before you start doing machine learning for stuff that matters?

Speaker: Josh Wills

Software Engineer, Search, Learning, and Intelligence @SlackHQ

Software Engineer working on Search and Learning @SlackHQ.

Find Josh Wills at