You are viewing content from a past/completed QCon

Track: Real-world Data Engineering

Location: Cyril Magnin III

Day of week: Wednesday

Showcasing DataEng tech and highlighting the strengths of each in real-world applications.

Track Host: Wes Reisz

Software/Technical Advisor C4Media & QCon Chair, previous Architect @HPE

Wes Reisz joined QCon in 2015 and leads QCon Editorial as the conference chair. Wes focuses his energies on providing a platform for practicing engineers to tell their war stories so innovative/early adopter stage engineers can learn, adopt, and, in many cases, challenge each other. Before joining the QCon Team, Wes held a variety of enterprise architecture and software development roles with HP. His focus with HP was around developing/federating identity, integration/development of Java stack applications, architecting portal/CM solutions, and delivering on mobility in places like US Army’s Human Resources Command (HRC), Army Recruiting Command, and Army Cadet Support Program. In 2002, Wes began teaching as an adjunct faculty member at the University of Louisville. He continues to teach 400-level web architecture and mobile development courses to undergraduates. He is currently teaching Mobile Application Development with Android.

9:00am - 9:10am

Gimel: Commoditizing Data Access

Accessing data across a multitude of data stores is extremely fragile and complicated. To address this issue, we built a unified data processing platform at PayPal called Gimel. In this short talk I will introduce you to Gimel's compute platform and data platform and give an overview of how PayPal's data scientists, analysts and developers are taking advantage of Gimel using GSQL and PayPal Notebooks. 

Romit Mehta, Product Manager, Data Platforms @PayPal

9:20am - 10:10am

Gimel: PayPal’s Analytics Data Platform

At PayPal, data engineers, analysts and data scientists work with a variety of datasources (Messaging, NoSQL, RDBMS, Documents, TSDB), compute engines (Spark, Flink, Beam, Hive), languages (Scala, Python, SQL) and execution models (stream, batch, interactive).
Due to this complex matrix of technologies and thousands of datasets, engineers spend considerable time learning about different data sources, formats, programming models, APIs, optimizations, etc. which impacts time-to-market (TTM). To solve this problem and to make product development more effective, PayPal Data Platform developed "Gimel", a unified analytics data platform which provides access to any storage through a single unified data API and SQL, that are powered by a centralized data catalog.
In this session, we will introduce you to the various components of Gimel - Compute Platform, Data API, PCatalog, GSQL and Notebooks. We will provide a demo depicting how Gimel reduces TTM by helping our engineers write a single line of code to access any storage without knowing the complexity behind the scenes.

Deepak Chandramouli, Analytics Tech Lead @PayPal

10:35am - 10:45am

A Whirlwind Overview of Apache Beam

Apache Beam offers a novel programming model for data processing with two major distinctive features: full unification of batch and streaming, and portability across different runners and different languages.
We give a quick overview of the fundamentals of the Beam programming model, and an even quicker overview of the project's place in the data processing ecosystem and its future directions.

Eugene Kirpichov, Cloud Dataflow Staff SE @Google

10:55am - 11:45am

Simplifying ML Workflows With Apache Beam

Tyler Akidau, Founder/Committer on Apache Beam & Engineer @Google

12:45pm - 12:55pm

Optimizing Spark

I'll provide guidelines for thinking about empirical performance evaluation of parallel programs in general and of Spark jobs in particular.  It's easier to be systematic about this if you think in terms of "what's the effective network bandwidth we're getting?" instead of "How fast does this particular job run?"  In addition, the figure of merit for parallel performance isn't necessarily obvious. If you want to minimize your AWS bill you should almost certainly run on a single node (but your job may take six months to finish).  You may think you want answers as quickly as possible, but if you could make a job finish in 55 minutes instead 60 minutes while doubling your AWS bill, would you do it?  No?  Then what exactly is the metric that you should optimize?

Greg Novak, Data Scientist @StitchFix

2:20pm - 2:30pm

(Past), Present, and Future of Apache Flink

Aljoscha Krettek, Co-Founder @dataArtisans

2:40pm - 3:30pm

Streaming SQL to Unify Batch & Stream Processing W/ Apache Flink @Uber

Fabian Hueske, Apache Flink PMC Member & Co-Founder @dataArtisans
Shuyi Chen, Senior Software Engineer II @Uber

2019 Tracks

  • Groking Timeseries & Sequential Data

    Techniques, practices, and approaches around time series and sequential data. Expect topics including image recognition, NLP/NLU, preprocess, & crunching of related algorithms.

  • Deep Learning in Practice

    Deep learning use cases around edge computing, deep learning for search, explainability, fairness, and perception.