Presentation: Optimizing Spark

Track: Real-world Data Engineering

Location: Cyril Magnin III

Duration: 12:45pm - 12:55pm

Day of week: Wednesday

Share this on:

Abstract

I'll provide guidelines for thinking about empirical performance evaluation of parallel programs in general and of Spark jobs in particular.  It's easier to be systematic about this if you think in terms of "what's the effective network bandwidth we're getting?" instead of "How fast does this particular job run?"  In addition, the figure of merit for parallel performance isn't necessarily obvious. If you want to minimize your AWS bill you should almost certainly run on a single node (but your job may take six months to finish).  You may think you want answers as quickly as possible, but if you could make a job finish in 55 minutes instead 60 minutes while doubling your AWS bill, would you do it?  No?  Then what exactly is the metric that you should optimize?

Note: This is a short talk. Short talks are 10-minute talks designed to offer breadth across the areas of machine learning, artificial intelligence, and data engineering. The short talks are focused on the tools and practices of data science with an eye towards the software engineer.

Speaker: Greg Novak

Data Scientist @StitchFix

Greg Novak leads the Global Optimization team at Stitch Fix, focusing on optimizations that that cut across the organizational and structural boundaries of the business.  To test and evaluate these cross-functional improvements, he has worked across the entire stack to develop software infrastructure enabling the use of experimental techniques previously unused at Stitch Fix.  Before joining Stitch Fix, Greg spent a decade doing research in astrophysics focused on black holes and galaxy evolution.

Find Greg Novak at

Proposed Tracks

  • Real-World Data Engineering

    Showcasing DataEng tech and highlighting the strengths of each in real-world applications.

  • Deep Learning Applications & Practices

    Deep learning lessons using Tensorflow, Keras, PyTorch, Caffe across machine translation, computer vision.

  • AI Meets the Physical World

    The track where AI touches the physical world, think drones, ROS, NVidea, TPU and more.

  • Data Architectures You've Always Wondered About

    How did they do that? Real-time predictive pipelines at places like Uber, Self-Driving Cars at Google, Robotic Warehouses from Ocado in the UK, are all possible examples.

  • Applied ML for Software

    Practical machine learning inside the data centers and on software engineering teams.

  • Time Series Patterns & Practices

    Stocks, ad tech/real-time bidding, and anomaly detection. Patterns and practices for more effective Time Series work.