Presentation: Optimizing Spark

Track: Real-world Data Engineering

Location: Cyril Magnin III

Duration: 12:45pm - 12:55pm

Day of week: Wednesday

Share this on:


I'll provide guidelines for thinking about empirical performance evaluation of parallel programs in general and of Spark jobs in particular.  It's easier to be systematic about this if you think in terms of "what's the effective network bandwidth we're getting?" instead of "How fast does this particular job run?"  In addition, the figure of merit for parallel performance isn't necessarily obvious. If you want to minimize your AWS bill you should almost certainly run on a single node (but your job may take six months to finish).  You may think you want answers as quickly as possible, but if you could make a job finish in 55 minutes instead 60 minutes while doubling your AWS bill, would you do it?  No?  Then what exactly is the metric that you should optimize?

Note: This is a short talk. Short talks are 10-minute talks designed to offer breadth across the areas of machine learning, artificial intelligence, and data engineering. The short talks are focused on the tools and practices of data science with an eye towards the software engineer.

Speaker: Greg Novak

Data Scientist @StitchFix

Greg Novak leads the Global Optimization team at Stitch Fix, focusing on optimizations that that cut across the organizational and structural boundaries of the business.  To test and evaluate these cross-functional improvements, he has worked across the entire stack to develop software infrastructure enabling the use of experimental techniques previously unused at Stitch Fix.  Before joining Stitch Fix, Greg spent a decade doing research in astrophysics focused on black holes and galaxy evolution.

Find Greg Novak at

Similar Talks

Data Scientist @StitchFix


  • Deep Learning Applications & Practices

    Deep learning lessons using tooling such as Tensorflow & PyTorch, across domains like large-scale cloud-native apps and fintech, and tacking concerns around interpretability of ML models.

  • Predictive Data Pipelines & Architectures

    Best practices for building real-world data pipelines doing interesting things like predictions, recommender systems, fraud prevention, ranking systems, and more.

  • ML in Action

    Applied track demonstrating how to train, score, and handle common machine learning use cases, including heavy concentration in the space of security and fraud

  • Real-world Data Engineering

    Showcasing DataEng tech and highlighting the strengths of each in real-world applications.