You are viewing content from a past/completed QCon

Track: Predictive Data Pipelines & Architectures

Location: Cyril Magnin I

Day of week: Tuesday

Predictive data pipelines have become essential to building engaging experiences on the web today. Whether you enjoy personalized news feeds on LinkedIn and Facebook, profit from near realtime updates to search engines and recommender systems, or benefit from near-realtime fraud detection on a lost or stolen credit card, you have come to rely on the fruits of predictive data pipelines as an end user. As a ops-focused engineer, you may employ these pipelines to understand complex call trees in your microservice-based infrastructure with the aim to eliminate redundant system load or improve mobile and web application performance. Come to this track to learn about interesting applications of predictive systems and the fundamentals that underlie them.

Track Host: Sid Anand

Chief Data Engineer @PayPal

Sid Anand currently serves as PayPal's Chief Data Engineer, focusing on ways to realize the value of data. Prior to joining PayPal, he held several positions including Agari's Data Architect, a Technical Lead in Search @ LinkedIn, Netflix’s Cloud Data Architect, Etsy’s VP of Engineering, and several technical roles at eBay. Sid earned his BS and MS degrees in CS from Cornell University, where he focused on Distributed Systems. In his spare time, he is a maintainer/committer on Apache Airflow, a co-chair for QCon, and a frequent speaker at conferences. When not working, Sid spends time with his wife, Shalini, and their 2 kids.

SHORT TALK (10 MIN)

10:40am - 10:50am

Transmogrification: The Magic of Feature Engineering

Leah McGuire, Principal Member of Technical Staff @Salesforce
Mayukh Bhaowal, Director of Product Management @Salesforce
CASE STUDY TALK (50 MIN)

11:00am - 11:50am

The Black Swan of Perfectly Interpretable Models

Machine Learning (ML) software differs from traditional software in the sense that outcomes are not based on a set of hand-coded rules and hence not easily predictable. The behavior of such software changes over time based on data and feedback loops. At Salesforce Einstein, we care deeply about building trust and confidence in such intelligent software programs. Why does a particular email have a higher likelihood of being opened than another? What are the shapes and patterns in the dataset, which lead to certain predictions? And can such insights be actionable?

As machine learning pervades every software vertical, and is increasingly used to automate decisions, model interpretability becomes an integral part of the ML pipeline, and can no longer be an afterthought. In the real world, the demand for being able to explain a model is rapidly gaining on model accuracy and other model evaluation metrics.

This talk will discuss the steps taken at Salesforce Einstein towards making machine learning transparent and less of a black box. We will explain how interpretability fits into the ML data pipeline, what we learned trying different approaches and how it has helped drive wider adoption of ML software.

Leah McGuire, Principal Member of Technical Staff @Salesforce
Mayukh Bhaowal, Director of Product Management @Salesforce
SHORT TALK (10 MIN)

2:25pm - 2:35pm

Building (Better) Data Pipelines with Apache Airflow

Sid Anand, Chief Data Engineer @PayPal
CASE STUDY TALK (50 MIN)

2:45pm - 3:35pm

Data Pipelines for Real-Time Fraud Prevention at Scale

PayPal processes about a billion dollars of payment volume daily ($354bn in FY2016); complex decisions are made for each transaction or user action, to manage risk and compliance, while also ensuring good user experience. PayPal users can make payments immediately in 200 countries with the assurance that the company’s transactions are secure. 

How does PayPal achieve this goal in today's complex environment filled with "high-level" fraudsters as well as constantly increasing customer demand? While many industry solutions rely on fast analytics performed in near-real time over streaming data, our business requirements demand real-time, millisecond-range response.

This talk will address the architectural approach towards our internally built real-time service platform, which delivers unparalleled performance and quality of decisions. This platform blurs the line between Big Data and sustainable support for a high volume of real-time decision requests. Well-structured design, along with domain modeling methodology provide for high adaptability to emerging fraud patterns and behavioral variations, deployment on real-time event-driven, fast data in-memory architecture that accelerates detection and decisions, thereby reducing losses, improving customer experience, and allowing efficient new integrations.  

Mikhail Kourjanski, Lead Data Architect @Paypal
SHORT TALK (10 MIN)

4:00pm - 4:10pm

pDB: Abstraction for Modeling Predictive Machine Learning Problems

Balaji Rengarajan, Senior Data Scientist @Celect
CASE STUDY TALK (50 MIN)

4:20pm - 5:10pm

pDB: Scalable Prediction Infrastructure With Precision and Provenance

We describe an extensible cloud independent data science platform based on Celect’s pDB framework for non-parametric machine learning. The pDB framework provides a common abstraction for almost of all machine learning problems of interest, including classification, personalization, time series predictions, linear and non linear regression. We developed an extensible and flexible data platform around the core pDB framework. This platform was borne out of the need for us to provide scalable and flexible predictive analytics solutions for Retailers and Federal Government.

In this talk, I will describe the pDB formalism associated with the platform, architectural aspects for data import/ETL, data transformation, compute and query architecture, cross-validation, cluster management, pipeline definition and workflow orchestration. We will illustrate the use of the platform through multiple use cases such as online personalization, document classification, and geospatial anomaly detection.

Balaji Rengarajan, Senior Data Scientist @Celect
On the topic of

Data Pipeline Practices

SHORT TALK (10 MIN)

12:50pm - 1:00pm

Two Effective Algorithms for Time Series Forecasting

Danny Yuan, Real-time Streaming Lead @Uber
CASE STUDY TALK (50 MIN)

1:10pm - 2:00pm

Machine Learning Pipeline for Real-time Forecasting @Uber Marketplace

Uber's Marketplace is the algorithmic brain behind Uber's ride-sharing services. To help Marketplace systems make proactive and efficient decisions, the Marketplace Forecasting team builds and operates multiple machine learning models to produce forecast of many metrics, including supply and demand, over both granular time and a large number of geo-spatial dimensions.

To empower both data scientists and engineers to build and manage models that range from regressions to neural networks in production, the Marketplace Forecasting team has built an highly scalable and automated machine learning platform that supports efficient feature engineering, distributed model training, turn-key model deployment, metric-based automatic model selection, and scalable model serving.

This talk will discuss how deep learning helps improve the accuracy and efficiency of our forecasting models, the architecture of the machine learning platform, how it was evolved from a simple ad-hoc system, and lessons learned in running the platform in production.

Danny Yuan, Real-time Streaming Lead @Uber
Chong Sun, Senior Software Engineer @Uber

Tracks

  • Groking Timeseries & Sequential Data

    Techniques, practices, and approaches, including image recognition, NLP, predictions, & modeling.

  • Deep Learning in Practice

    Deep learning lessons using Tensorflow, Keras, PyTorch, Caffe including use cases on machine translation, computer vision, & image recogition.

  • AI Meets the Physical World

    Where AI touches the physical world, think drones, ROS, NVidia, TPU and more.

  • Papers to Production: CS in the Real World

    Groundbreaking papers make real world impact.

  • Solving Software Engineering Problems with Machine Learning

    Anomaly detection, ML in IDE's, bayesian optimization for config. Machine Learning techniques for more effective software engineering.

  • Predictive Architectures in the Real World

    Case Study focused look at end to end predictive pipelines from places like Salesforce, Uber, Linkedin, & Netflix.