Presentation: When We Spark and When We Don’t: Developing Data and ML Pipelines

Track: Real-world Data Engineering

Location: Cyril Magnin III

Duration: 1:05pm - 1:55pm

Day of week: Wednesday

Share this on:

Abstract

The data platform at Stitch Fix runs thousands of jobs a day to feed data products that provide algorithmic capabilities to power nearly all aspects of the business, from merchandising to operations to styling recommendations. Many of these jobs are distributed across Spark clusters, while many others are scheduled as isolated single-node tasks in containers running Python, R, or Scala. Pipelines are often comprised of a mix of task types and containers.

This talk will cover thoughts and guidelines on how we develop, schedule, and maintain these pipelines at Stitch Fix. We’ll discuss guidelines on how we think about which portions of the pipelines we develop to run on what platforms (e.g. what is important to run distributed across Spark clusters vs run in stand-alone containers) and how we get them to play well together. We’ll also provide an overview of tools and abstractions that have been developed at Stitch Fix to facilitate the process from development, to deployment, to monitoring them in production.

Speaker: Jeff Magnusson

VP Data Platform @StitchFix

As Director of the Data Platform at Stitch Fix, Jeff Magnusson leads the team responsible for building robust and scalable infrastructure and data services that integrate with numerous interfaces across the business. By leveraging machine computation together with expert­human judgement to generate recommendations and insights, these platforms unlock innovative ways to utilize data science and machine learning that optimize and differentiate the way the company operates the business. Prior to Stitch Fix, Jeff managed the Data Platform Architecture team at Netflix, where he helped design and open source many of the components of the Hadoop based infrastructure and big data platform. Jeff holds a PhD from the University of Florida, specializing in database system implementation.

Find Jeff Magnusson at

Proposed Tracks

  • Real-World Data Engineering

    Showcasing DataEng tech and highlighting the strengths of each in real-world applications.

  • Deep Learning Applications & Practices

    Deep learning lessons using Tensorflow, Keras, PyTorch, Caffe across machine translation, computer vision.

  • AI Meets the Physical World

    The track where AI touches the physical world, think drones, ROS, NVidea, TPU and more.

  • Data Architectures You've Always Wondered About

    How did they do that? Real-time predictive pipelines at places like Uber, Self-Driving Cars at Google, Robotic Warehouses from Ocado in the UK, are all possible examples.

  • Applied ML for Software

    Practical machine learning inside the data centers and on software engineering teams.

  • Time Series Patterns & Practices

    Stocks, ad tech/real-time bidding, and anomaly detection. Patterns and practices for more effective Time Series work.