Presentation: Very Large Datasets With the GPU Data Frame

Track: Hands-on Codelabs & Speakers Office Hours

Location: Mission

Duration: 10:35am - 10:45am

Day of week: Wednesday

Share this on:

Abstract

Use of the humble GPU has spiked over the past couple years as machine learning and data analytics workloads have been optimized to take advantage of the GPU’s parallelism and memory bandwidth. Even though these operations (the steps of the Machine Learning Pipeline) could all be run on the same GPUs, they were typically isolated, and much slower than they needed to be, because data was serialized and deserialized between the steps over PCIe.

That inefficiency was recently addressed by the formation of the GPU Open Analytics Initiative (GOAI http://gpuopenanalytics.com/), an industry standard founded by MapD, H2O.ai and Anaconda. This group created the GPU data frame (GDF), based on Apache Arrow, for passing data between processes and keeping it all in the GPU. In this talk we will explain how the GDF technology works, show how it is enabling a diverse set of GPU workloads, and demonstrate how to use a Jupyter Notebook to take advantage of it. We’ll demonstrate on a very large dataset how to manage a full Machine Learning Pipeline with minimal data exchange overhead between MapD’s SQL engine and H2O’s generalized linear model library (GLM).

Note: This is a short talk. Short talks are 10-minute talks designed to offer breadth across the areas of machine learning, artificial intelligence, and data engineering. The short talks are focused on the tools and practices of data science with an eye towards the software engineer.

Speaker: Veda Shankar

Senior Developer Advocate @MapD

Veda Shankar is a Developer Advocate at MapD working actively to assist the user community to take advantage of MapD’s open source analytics platform. He is a customer oriented IT specialist with a unique combination of experience in product development, marketing and sales engineering. Prior to MapD, Veda worked on various open source software defined data center products at Red Hat.

Find Veda Shankar at

Proposed Tracks

  • Real-World Data Engineering

    Showcasing DataEng tech and highlighting the strengths of each in real-world applications.

  • Deep Learning Applications & Practices

    Deep learning lessons using Tensorflow, Keras, PyTorch, Caffe across machine translation, computer vision.

  • AI Meets the Physical World

    The track where AI touches the physical world, think drones, ROS, NVidea, TPU and more.

  • Data Architectures You've Always Wondered About

    How did they do that? Real-time predictive pipelines at places like Uber, Self-Driving Cars at Google, Robotic Warehouses from Ocado in the UK, are all possible examples.

  • Applied ML for Software

    Practical machine learning inside the data centers and on software engineering teams.

  • Time Series Patterns & Practices

    Stocks, ad tech/real-time bidding, and anomaly detection. Patterns and practices for more effective Time Series work.