Presentation: When We Spark and When We Don’t: Developing Data and ML Pipelines

Track: Real-world Data Engineering

Location: Cyril Magnin III

Duration: 1:05pm - 1:55pm

Day of week: Wednesday

Share this on:

Abstract

The data platform at Stitch Fix runs thousands of jobs a day to feed data products that provide algorithmic capabilities to power nearly all aspects of the business, from merchandising to operations to styling recommendations. Many of these jobs are distributed across Spark clusters, while many others are scheduled as isolated single-node tasks in containers running Python, R, or Scala. Pipelines are often comprised of a mix of task types and containers.

This talk will cover thoughts and guidelines on how we develop, schedule, and maintain these pipelines at Stitch Fix. We’ll discuss guidelines on how we think about which portions of the pipelines we develop to run on what platforms (e.g. what is important to run distributed across Spark clusters vs run in stand-alone containers) and how we get them to play well together. We’ll also provide an overview of tools and abstractions that have been developed at Stitch Fix to facilitate the process from development, to deployment, to monitoring them in production.

Speaker: Jeff Magnusson

VP Data Platform @StitchFix

As Director of the Data Platform at Stitch Fix, Jeff Magnusson leads the team responsible for building robust and scalable infrastructure and data services that integrate with numerous interfaces across the business. By leveraging machine computation together with expert­human judgement to generate recommendations and insights, these platforms unlock innovative ways to utilize data science and machine learning that optimize and differentiate the way the company operates the business. Prior to Stitch Fix, Jeff managed the Data Platform Architecture team at Netflix, where he helped design and open source many of the components of the Hadoop based infrastructure and big data platform. Jeff holds a PhD from the University of Florida, specializing in database system implementation.

Find Jeff Magnusson at

Similar Talks

Data Scientist @StitchFix

Tracks

  • Deep Learning Applications & Practices

    Deep learning lessons using tooling such as Tensorflow & PyTorch, across domains like large-scale cloud-native apps and fintech, and tacking concerns around interpretability of ML models.

  • Predictive Data Pipelines & Architectures

    Best practices for building real-world data pipelines doing interesting things like predictions, recommender systems, fraud prevention, ranking systems, and more.

  • ML in Action

    Applied track demonstrating how to train, score, and handle common machine learning use cases, including heavy concentration in the space of security and fraud

  • Real-world Data Engineering

    Showcasing DataEng tech and highlighting the strengths of each in real-world applications.