You are viewing content from a past/completed QCon

Presentation: Optimizing Spark

Track: Real-world Data Engineering

Location: Cyril Magnin III

Duration: 12:45pm - 12:55pm

Day of week: Wednesday

Share this on:


I'll provide guidelines for thinking about empirical performance evaluation of parallel programs in general and of Spark jobs in particular.  It's easier to be systematic about this if you think in terms of "what's the effective network bandwidth we're getting?" instead of "How fast does this particular job run?"  In addition, the figure of merit for parallel performance isn't necessarily obvious. If you want to minimize your AWS bill you should almost certainly run on a single node (but your job may take six months to finish).  You may think you want answers as quickly as possible, but if you could make a job finish in 55 minutes instead 60 minutes while doubling your AWS bill, would you do it?  No?  Then what exactly is the metric that you should optimize?

Speaker: Greg Novak

Data Scientist @StitchFix

Greg Novak leads the Global Optimization team at Stitch Fix, focusing on optimizations that that cut across the organizational and structural boundaries of the business.  To test and evaluate these cross-functional improvements, he has worked across the entire stack to develop software infrastructure enabling the use of experimental techniques previously unused at Stitch Fix.  Before joining Stitch Fix, Greg spent a decade doing research in astrophysics focused on black holes and galaxy evolution.

Find Greg Novak at

2019 Tracks

  • Groking Timeseries & Sequential Data

    Techniques, practices, and approaches around time series and sequential data. Expect topics including image recognition, NLP/NLU, preprocess, & crunching of related algorithms.

  • Deep Learning in Practice

    Deep learning use cases around edge computing, deep learning for search, explainability, fairness, and perception.