You are viewing content from a past/completed QCon

Track: Predictive Data Pipelines & Architectures

Location: Cyril Magnin I

Day of week: Tuesday

Predictive data pipelines have become essential to building engaging experiences on the web today. Whether you enjoy personalized news feeds on LinkedIn and Facebook, profit from near realtime updates to search engines and recommender systems, or benefit from near-realtime fraud detection on a lost or stolen credit card, you have come to rely on the fruits of predictive data pipelines as an end user. As a ops-focused engineer, you may employ these pipelines to understand complex call trees in your microservice-based infrastructure with the aim to eliminate redundant system load or improve mobile and web application performance. Come to this track to learn about interesting applications of predictive systems and the fundamentals that underlie them.

Track Host: Sid Anand

Chief Data Engineer @PayPal

Sid Anand currently serves as PayPal's Chief Data Engineer, focusing on ways to realize the value of data. Prior to joining PayPal, he held several positions including Agari's Data Architect, a Technical Lead in Search @ LinkedIn, Netflix’s Cloud Data Architect, Etsy’s VP of Engineering, and several technical roles at eBay. Sid earned his BS and MS degrees in CS from Cornell University, where he focused on Distributed Systems. In his spare time, he is a maintainer/committer on Apache Airflow, a co-chair for QCon, and a frequent speaker at conferences. When not working, Sid spends time with his wife, Shalini, and their 2 kids.

10:40am - 10:50am

Transmogrification: The Magic of Feature Engineering

Machine learning algorithms often take center stage in machine learning and AI. However, in the real world, 90% of the time spent building models goes into creating the mythical perfect numeric matrix of features, to feed into the chosen algorithm. Every machine learning team repeats the same effort, reinventing the wheel once again.

In this session, you'll learn about transmogrification, where we magically and automatically engineer features based on the type of feature, data distribution and association with the response variable.

Leah McGuire, Principal Member of Technical Staff @Salesforce
Mayukh Bhaowal, Director of Product Management @Salesforce

11:00am - 11:50am

The Black Swan of Perfectly Interpretable Models

Machine Learning (ML) software differs from traditional software in the sense that outcomes are not based on a set of hand-coded rules and hence not easily predictable. The behavior of such software changes over time based on data and feedback loops. At Salesforce Einstein, we care deeply about building trust and confidence in such intelligent software programs. Why does a particular email have a higher likelihood of being opened than another? What are the shapes and patterns in the dataset, which lead to certain predictions? And can such insights be actionable?

As machine learning pervades every software vertical, and is increasingly used to automate decisions, model interpretability becomes an integral part of the ML pipeline, and can no longer be an afterthought. In the real world, the demand for being able to explain a model is rapidly gaining on model accuracy and other model evaluation metrics.

This talk will discuss the steps taken at Salesforce Einstein towards making machine learning transparent and less of a black box. We will explain how interpretability fits into the ML data pipeline, what we learned trying different approaches and how it has helped drive wider adoption of ML software.

Leah McGuire, Principal Member of Technical Staff @Salesforce
Mayukh Bhaowal, Director of Product Management @Salesforce

2:25pm - 2:35pm

Building (Better) Data Pipelines with Apache Airflow

Apache Airflow is an up-and-coming platform to programmatically author, schedule, manage, and monitor workflows. Central to Airflow’s design is that is requires users to define DAGs (directed acyclic graphs) a.k.a. workflows in Python code, so that DAGs can be managed via the same software engineering principles and practices used to manage any other code.

With more than 7600 GitHub stars, 2400 forks, 430 contributors, 150 companies officially using it, and 4600 commits, it is quickly gaining traction among data science, ETL engineering, data engineering, and devops communities at large. What makes Apache Airflow so popular? Come to this talk to get a whirlwind intro based on a real-world predictive data pipeline example.

Sid Anand, Chief Data Engineer @PayPal

2:45pm - 3:35pm

Data Pipelines for Real-Time Fraud Prevention at Scale

PayPal processes about a billion dollars of payment volume daily ($354bn in FY2016); complex decisions are made for each transaction or user action, to manage risk and compliance, while also ensuring good user experience. PayPal users can make payments immediately in 200 countries with the assurance that the company’s transactions are secure. 

How does PayPal achieve this goal in today's complex environment filled with "high-level" fraudsters as well as constantly increasing customer demand? While many industry solutions rely on fast analytics performed in near-real time over streaming data, our business requirements demand real-time, millisecond-range response.

This talk will address the architectural approach towards our internally built real-time service platform, which delivers unparalleled performance and quality of decisions. This platform blurs the line between Big Data and sustainable support for a high volume of real-time decision requests. Well-structured design, along with domain modeling methodology provide for high adaptability to emerging fraud patterns and behavioral variations, deployment on real-time event-driven, fast data in-memory architecture that accelerates detection and decisions, thereby reducing losses, improving customer experience, and allowing efficient new integrations.  

Mikhail Kourjanski, Lead Data Architect @Paypal

4:00pm - 4:10pm

pDB: Abstraction for Modeling Predictive Machine Learning Problems

In this talk, we will do a brief overview of modeling machine learning problems using Celect’s pDB framework. This forms the basis of the enterprise grade prediction analytics platform for retail and federal intelligence that we will describe in the longer talk. We will demonstrate how disparate predictive problems can be expressed using a common pDB language.

Balaji Rengarajan, Senior Data Scientist @Celect

4:20pm - 5:10pm

pDB: Scalable Prediction Infrastructure With Precision and Provenance

We describe an extensible cloud independent data science platform based on Celect’s pDB framework for non-parametric machine learning. The pDB framework provides a common abstraction for almost of all machine learning problems of interest, including classification, personalization, time series predictions, linear and non linear regression. We developed an extensible and flexible data platform around the core pDB framework. This platform was borne out of the need for us to provide scalable and flexible predictive analytics solutions for Retailers and Federal Government.

In this talk, I will describe the pDB formalism associated with the platform, architectural aspects for data import/ETL, data transformation, compute and query architecture, cross-validation, cluster management, pipeline definition and workflow orchestration. We will illustrate the use of the platform through multiple use cases such as online personalization, document classification, and geospatial anomaly detection.

Balaji Rengarajan, Senior Data Scientist @Celect
On the topic of

Data Pipeline Practices

12:50pm - 1:00pm

Two Effective Algorithms for Time Series Forecasting

In this 10-minute talk we will explain intuitively fast Fourier transformation and recurrent neural network, two key tools that will be discussed in the later talk. We will also explore how the concepts play critical roles in time series forecasting. The audience of this talk learn what the tools, key concepts associated with them, and why they are useful in time series forecasting.

Danny Yuan, Real-time Streaming Lead @Uber

1:10pm - 2:00pm

Machine Learning Pipeline for Real-time Forecasting @Uber Marketplace

Uber's Marketplace is the algorithmic brain behind Uber's ride-sharing services. To help Marketplace systems make proactive and efficient decisions, the Marketplace Forecasting team builds and operates multiple machine learning models to produce forecast of many metrics, including supply and demand, over both granular time and a large number of geo-spatial dimensions.

To empower both data scientists and engineers to build and manage models that range from regressions to neural networks in production, the Marketplace Forecasting team has built an highly scalable and automated machine learning platform that supports efficient feature engineering, distributed model training, turn-key model deployment, metric-based automatic model selection, and scalable model serving.

This talk will discuss how deep learning helps improve the accuracy and efficiency of our forecasting models, the architecture of the machine learning platform, how it was evolved from a simple ad-hoc system, and lessons learned in running the platform in production.

Danny Yuan, Real-time Streaming Lead @Uber
Chong Sun, Senior Software Engineer @Uber

2019 Tracks

  • Predictive Data Pipelines & Architectures

    Case Study focused look at end to end predictive pipelines from places like Salesforce, Uber, Linkedin, & Netflix

  • Sequential Data: Natural Language, Time Series, and Sound

    Techniques, practices, and approaches around time series and sequential data. Expect topics including image recognition, NLP/NLU, preprocess, & crunching of related algorithms.

  • ML in Action

    Applied track demonstrating how to train, score, and handle common machine learning use cases, including heavy concentration in the space of security and fraud