Real-world Data Engineering

Location: Cyril Magnin III

Day of week: Wednesday

Showcasing DataEng tech and highlighting the strengths of each in real-world applications.

Track Host:
Wes Reisz
Software/Technical Advisor C4Media & QCon Chair, previous Architect @HPE

Wes Reisz joined QCon in 2015 and leads QCon Editorial as the conference chair. Wes focuses his energies on providing a platform for practicing engineers to tell their war stories so innovative/early adopter stage engineers can learn, adopt, and, in many cases, challenge each other. Before joining the QCon Team, Wes held a variety of enterprise architecture and software development roles with HP. His focus with HP was around developing/federating identity, integration/development of Java stack applications, architecting portal/CM solutions, and delivering on mobility in places like US Army’s Human Resources Command (HRC), Army Recruiting Command, and Army Cadet Support Program. In 2002, Wes began teaching as an adjunct faculty member at the University of Louisville. He continues to teach 400-level web architecture and mobile development courses to undergraduates. He is currently teaching Mobile Application Development with Android.


9:00am - 9:10am

Gimel: Commoditizing Data Access

Romit Mehta, Product Manager, Data Platforms @PayPal

9:20am - 10:10am

Gimel: PayPal’s Analytics Data Platform

At PayPal, data engineers, analysts and data scientists work with a variety of datasources (Messaging, NoSQL, RDBMS, Documents, TSDB), compute engines (Spark, Flink, Beam, Hive), languages (Scala, Python, SQL) and execution models (stream, batch, interactive).

Due to this complex matrix of technologies and thousands of datasets, engineers spend considerable time learning about different data sources, formats, programming models, APIs, optimizations, etc. which impacts time-to-market (TTM). To solve this problem and to make product development more effective, PayPal Data Platform developed "Gimel", a unified analytics data platform which provides access to any storage through a single unified data API and SQL, that are powered by a centralized data catalog.

In this session, we will introduce you to the various components of Gimel - Compute Platform, Data API, PCatalog, GSQL and Notebooks. We will provide a demo depicting how Gimel reduces TTM by helping our engineers write a single line of code to access any storage without knowing the complexity behind the scenes.

Deepak Chandramouli, Analytics Tech Lead @PayPal

10:35am - 10:45am

A Whirlwind Overview of Apache Beam

Eugene Kirpichov, Cloud Dataflow Staff SE @Google

10:55am - 11:45am

Simplifying ML Workflows With Apache Beam

Come learn how Apache Beam is simplifying pre- and post-processing for ML pipelines. Apache Beam provides a portability layer that allows Beam pipelines to be written once and executed on any supported runtime. 2018 will be the year in which the Beam community completes the portability vision laid out in when the project was founded, with full cross-language portability and robust open source runner support for Apache Flink and Spark.

Come see where we are in that journey, and learn how Beam is being integrated into the world of AI.

Tyler Akidau, Founder/Committer on Apache Beam & Engineer @Google

12:45pm - 12:55pm

Optimizing Spark

Greg Novak, Data Scientist @StitchFix

1:05pm - 1:55pm

When We Spark and When We Don’t: Developing Data and ML Pipelines

The data platform at Stitch Fix runs thousands of jobs a day to feed data products that provide algorithmic capabilities to power nearly all aspects of the business, from merchandising to operations to styling recommendations. Many of these jobs are distributed across Spark clusters, while many others are scheduled as isolated single-node tasks in containers running Python, R, or Scala. Pipelines are often comprised of a mix of task types and containers.

This talk will cover thoughts and guidelines on how we develop, schedule, and maintain these pipelines at Stitch Fix. We’ll discuss guidelines on how we think about which portions of the pipelines we develop to run on what platforms (e.g. what is important to run distributed across Spark clusters vs run in stand-alone containers) and how we get them to play well together. We’ll also provide an overview of tools and abstractions that have been developed at Stitch Fix to facilitate the process from development, to deployment, to monitoring them in production.

Jeff Magnusson, VP Data Platform @StitchFix

2:20pm - 2:30pm

(Past), Present, and Future of Apache Flink

Aljoscha Krettek, Co-Founder @dataArtisans

2:40pm - 3:30pm

Streaming SQL to Unify Batch & Stream Processing W/ Apache Flink @Uber

SQL is the lingua franca for querying and processing data. To this day, it provides nonprogrammers with a powerful tool for analyzing and manipulating data. But with the emergence of stream processing as a core technology for data infrastructures, can you still use SQL and bring real-time data analysis to a broader audience?

The answer is yes, you can. SQL fits into the streaming world very well and forms an intuitive and powerful abstraction for streaming analytics. More importantly, you can use SQL as an abstraction to unify batch and streaming data processing. Viewing streams as dynamic tables, you can obtain consistent results from SQL evaluated over static tables and streams alike and use SQL to build materialized views as a data integration tool.

Fabian Hueske and Shuyi Chen explore SQL’s role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges and how the unified stream and batch processing platform enables both technical or nontechnical users to process real-time and batch data reliably using the same SQL at Uber scale.

Fabian Hueske, Apache Flink PMC Member & Co-Founder @dataArtisans
Shuyi Chen, Senior Software Engineer II @Uber


  • Deep Learning Applications & Practices

    Deep learning lessons using tooling such as Tensorflow & PyTorch, across domains like large-scale cloud-native apps and fintech, and tacking concerns around interpretability of ML models.

  • Predictive Data Pipelines & Architectures

    Best practices for building real-world data pipelines doing interesting things like predictions, recommender systems, fraud prevention, ranking systems, and more.

  • ML in Action

    Applied track demonstrating how to train, score, and handle common machine learning use cases, including heavy concentration in the space of security and fraud

  • Real-world Data Engineering

    Showcasing DataEng tech and highlighting the strengths of each in real-world applications.