Real-world Data Engineering

Location: Cyril Magnin III

Day of week: Wednesday

Showcasing DataEng tech and highlighting the strengths of each in real-world applications.

Track Host:
Wes Reisz
Software/Technical Advisor C4Media & QCon Chair, previous Architect @HPE

Wes Reisz joined QCon in 2015 and leads QCon Editorial as the conference chair. Wes focuses his energies on providing a platform for practicing engineers to tell their war stories so innovative/early adopter stage engineers can learn, adopt, and, in many cases, challenge each other. Before joining the QCon Team, Wes held a variety of enterprise architecture and software development roles with HP. His focus with HP was around developing/federating identity, integration/development of Java stack applications, architecting portal/CM solutions, and delivering on mobility in places like US Army’s Human Resources Command (HRC), Army Recruiting Command, and Army Cadet Support Program. In 2002, Wes began teaching as an adjunct faculty member at the University of Louisville. He continues to teach 400-level web architecture and mobile development courses to undergraduates. He is currently teaching Mobile Application Development with Android.


9:00am - 9:10am

Gimel Up and Running

Romit Mehta, Product Manager, Data Platforms @PayPal

9:20am - 10:10am

Gimel: PayPal’s Analytics Data Platform

At PayPal, data engineers, analysts and data scientists work with a variety of datasources (Messaging, NoSQL, RDBMS, Documents, TSDB), compute engines (Spark, Flink, Beam, Hive), languages (Scala, Python, SQL) and execution models (stream, batch, interactive).

Due to this complex matrix of technologies and thousands of datasets, engineers spend considerable time learning about different data sources, formats, programming models, APIs, optimizations, etc. which impacts time-to-market (TTM). To solve this problem and to make product development more effective, PayPal Data Platform developed "Gimel", a unified analytics data platform which provides access to any storage through a single unified data API and SQL, that are powered by a centralized data catalog.

In this session, we will introduce you to the various components of Gimel - Compute Platform, Data API, PCatalog, GSQL and Notebooks. We will provide a demo depicting how Gimel reduces TTM by helping our engineers write a single line of code to access any storage without knowing the complexity behind the scenes.

Deepak Chandramouli, Analytics Tech Lead @PayPal

10:35am - 10:45am

A Whirlwind Overview of Apache Beam

Eugene Kirpichov, Cloud Dataflow Staff SE @Google

10:55am - 11:45am

Simplifying ML Workflows With Apache Beam

Come learn how Apache Beam is simplifying pre- and post-processing for ML pipelines. Apache Beam provides a portability layer that allows Beam pipelines to be written once and executed on any supported runtime. 2018 will be the year in which the Beam community completes the portability vision laid out in when the project was founded, with full cross-language portability and robust open source runner support for Apache Flink and Spark.

Come see where we are in that journey, and learn how Beam is being integrated into the world of AI.

Tyler Akidau, Founder/Committer on Apache Beam & Engineer @Google

12:45pm - 12:55pm

Real-world Data Engineering Presentation


1:05pm - 1:55pm

When We Spark and When We Don’t: Developing Data and ML Pipelines at Stitch Fix

The data platform at Stitch Fix runs thousands of jobs a day to feed data products that provide algorithmic capabilities to power nearly all aspects of the business, from merchandising to operations to styling recommendations. Many of these jobs are distributed across Spark clusters, while many others are scheduled as isolated single-node tasks in containers running Python, R, or Scala. Pipelines are often comprised of a mix of task types and containers.

This talk will cover thoughts and guidelines on how we develop, schedule, and maintain these pipelines at Stitch Fix. We’ll discuss guidelines on how we think about which portions of the pipelines we develop to run on what platforms (e.g. what is important to run distributed across Spark clusters vs run in stand-alone containers) and how we get them to play well together. We’ll also provide an overview of tools and abstractions that have been developed at Stitch Fix to facilitate the process from development, to deployment, to monitoring them in production.

Jeff Magnusson, VP Data Platform @StitchFix

2:20pm - 2:30pm

Real-world Data Engineering Presentation


2:40pm - 3:30pm

Real-world Data Engineering Detailed Case Study

Each talk at QCon is hand-selected by our track hosts. We are currently discussing potential speakers for this track.


  • ML in Action

    Applied track demonstrating how to train, score, and handle common machine learning use cases, including heavy concentration in the space of security and fraud

  • Real-world Data Engineering

    Showcasing DataEng tech and highlighting the strengths of each in real-world applications.