You are viewing content from a past/completed QCon

Track: Real-world Data Engineering

Location: Cyril Magnin III

Day of week: Wednesday

Showcasing DataEng tech and highlighting the strengths of each in real-world applications.

Track Host: Wes Reisz

Software/Technical Advisor C4Media & QCon Chair, previous Architect @HPE

Wesley Reisz is the VP of Technology for Section (an Edge PaaS focused on rethinking how the edge is used in DevOps focused application development). Wes also chairs the LFEdge Landscape Working Group and the San Francisco edition of the software conference QCon.

Before joining Section, Wes served as the product owner for all of the English speaking QCon conferences world wide, was a principal architect with HP Enterprise Systems, and, for over 13 years, taught as an adjunct professor for the University of Louisville (Go Cards!).  

At HPE, Wes’ primary roles supported the US Army’s Human Resources (HRC), Recruiting, and Cadet Support Commands based at Fort Knox, Kentucky. Wes was the principal architect for US Army Cadet Command and was known for championing, building, and deploying enterprise portal and identity solutions used by Army Recruiting.

In addition to Wes’ current roles, he hosts a weekly podcast called The InfoQ Podcast. The InfoQ Podcast serves senior early adopter/early majority developers and architects with interviews from some of software’s most important thought leaders. The podcast has been downloaded over 1.5 million times and has weekly listener base of around 14k.

9:00am - 9:10am

Gimel: Commoditizing Data Access

Accessing data across a multitude of data stores is extremely fragile and complicated. To address this issue, we built a unified data processing platform at PayPal called Gimel. In this short talk I will introduce you to Gimel's compute platform and data platform and give an overview of how PayPal's data scientists, analysts and developers are taking advantage of Gimel using GSQL and PayPal Notebooks. 

Romit Mehta, Product Manager, Data Platforms @PayPal

9:20am - 10:10am

Gimel: PayPal’s Analytics Data Platform

At PayPal, data engineers, analysts and data scientists work with a variety of datasources (Messaging, NoSQL, RDBMS, Documents, TSDB), compute engines (Spark, Flink, Beam, Hive), languages (Scala, Python, SQL) and execution models (stream, batch, interactive).

Due to this complex matrix of technologies and thousands of datasets, engineers spend considerable time learning about different data sources, formats, programming models, APIs, optimizations, etc. which impacts time-to-market (TTM). To solve this problem and to make product development more effective, PayPal Data Platform developed "Gimel", a unified analytics data platform which provides access to any storage through a single unified data API and SQL, that are powered by a centralized data catalog.

In this session, we will introduce you to the various components of Gimel - Compute Platform, Data API, PCatalog, GSQL and Notebooks. We will provide a demo depicting how Gimel reduces TTM by helping our engineers write a single line of code to access any storage without knowing the complexity behind the scenes.

Deepak Chandramouli, Analytics Tech Lead @PayPal

10:35am - 10:45am

A Whirlwind Overview of Apache Beam

Apache Beam offers a novel programming model for data processing with two major distinctive features: full unification of batch and streaming, and portability across different runners and different languages.

We give a quick overview of the fundamentals of the Beam programming model, and an even quicker overview of the project's place in the data processing ecosystem and its future directions.

Eugene Kirpichov, Cloud Dataflow Staff SE @Google

10:55am - 11:45am

Simplifying ML Workflows With Apache Beam

Come learn how Apache Beam is simplifying pre- and post-processing for ML pipelines. Apache Beam provides a portability layer that allows Beam pipelines to be written once and executed on any supported runtime. 2018 will be the year in which the Beam community completes the portability vision laid out in when the project was founded, with full cross-language portability and robust open source runner support for Apache Flink and Spark.

Come see where we are in that journey, and learn how Beam is being integrated into the world of AI.

Tyler Akidau, Founder/Committer on Apache Beam & Engineer @Google

12:45pm - 12:55pm

Optimizing Spark

I'll provide guidelines for thinking about empirical performance evaluation of parallel programs in general and of Spark jobs in particular.  It's easier to be systematic about this if you think in terms of "what's the effective network bandwidth we're getting?" instead of "How fast does this particular job run?"  In addition, the figure of merit for parallel performance isn't necessarily obvious. If you want to minimize your AWS bill you should almost certainly run on a single node (but your job may take six months to finish).  You may think you want answers as quickly as possible, but if you could make a job finish in 55 minutes instead 60 minutes while doubling your AWS bill, would you do it?  No?  Then what exactly is the metric that you should optimize?

Greg Novak, Data Scientist @StitchFix

1:05pm - 1:55pm

When We Spark and When We Don’t: Developing Data and ML Pipelines

The data platform at Stitch Fix runs thousands of jobs a day to feed data products that provide algorithmic capabilities to power nearly all aspects of the business, from merchandising to operations to styling recommendations. Many of these jobs are distributed across Spark clusters, while many others are scheduled as isolated single-node tasks in containers running Python, R, or Scala. Pipelines are often comprised of a mix of task types and containers.

This talk will cover thoughts and guidelines on how we develop, schedule, and maintain these pipelines at Stitch Fix. We’ll discuss guidelines on how we think about which portions of the pipelines we develop to run on what platforms (e.g. what is important to run distributed across Spark clusters vs run in stand-alone containers) and how we get them to play well together. We’ll also provide an overview of tools and abstractions that have been developed at Stitch Fix to facilitate the process from development, to deployment, to monitoring them in production.

Jeff Magnusson, VP Data Platform @StitchFix

2:20pm - 2:30pm

(Past), Present, and Future of Apache Flink

Apache Flink offers a fast, distributed, and failure-tolerant data-processing engine along with APIs for many different use cases, chief among them stateful stream processing. We give a quick overview of the capabilities of Flink before discussing the current state of Flink, the upcoming new release, and future developments.

Aljoscha Krettek, Co-Founder @dataArtisans

2:40pm - 3:30pm

Streaming SQL to Unify Batch & Stream Processing W/ Apache Flink @Uber

SQL is the lingua franca for querying and processing data. To this day, it provides nonprogrammers with a powerful tool for analyzing and manipulating data. But with the emergence of stream processing as a core technology for data infrastructures, can you still use SQL and bring real-time data analysis to a broader audience?

The answer is yes, you can. SQL fits into the streaming world very well and forms an intuitive and powerful abstraction for streaming analytics. More importantly, you can use SQL as an abstraction to unify batch and streaming data processing. Viewing streams as dynamic tables, you can obtain consistent results from SQL evaluated over static tables and streams alike and use SQL to build materialized views as a data integration tool.

Fabian Hueske and Shuyi Chen explore SQL’s role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges and how the unified stream and batch processing platform enables both technical or nontechnical users to process real-time and batch data reliably using the same SQL at Uber scale.

Fabian Hueske, Apache Flink PMC Member & Co-Founder @dataArtisans
Shuyi Chen, Senior Software Engineer II @Uber

2019 Tracks

  • ML in Action

    Applied track demonstrating how to train, score, and handle common machine learning use cases, including heavy concentration in the space of security and fraud

  • Deep Learning in Practice

    Deep learning use cases around edge computing, deep learning for search, explainability, fairness, and perception.

  • Handling Sequential Data Like an Expert / ML Applied to Operations

    Discussing the complexities of time (half track) and Machine Learning in the data center (half track). Exploring topics from hyper loglog to predictive auto-scaling in each of two half-day tracks.

    Half-day tracks