You are viewing content from a past/completed QCon

Presentation: Massive Scale Anomaly Detection Framework

Track: Predictive Architectures in the Real World

Location: Cyril Magnin I

Duration: 11:40am - 12:20pm

Day of week: Tuesday

Share this on:

This presentation is now available to view on InfoQ.com

Watch video with transcript

Abstract

Early detection of abnormal events can be critical for many business applications, however there are numerous challenges when implementing real-time anomaly models at scale. Server failure, developer error and malicious activities are very different scenarios with different engineering requirements. Moreover, most analytical models have been traditionally designed for the batch processing paradigm and usually cannot be easily adapted to unbounded datasets and real-time latencies.

 

At PayPal, we must be able to analyze billions of events every day in real-time across a wide range of services, devices and locations. In a collaboration between our Platform engineering team and data science teams, we have built a generic framework for developing robust and scalable anomaly detection streaming applications, focusing on flexibility to support different types of statistical and machine learning models. Inspired by the design of scikit-learn and Spark MLlib, we have designed a simple pipeline-based API on top of Spark Structured Streaming, that captures common patterns of the anomaly detection domain. 

 

At the base of the framework, we took advantage of Spark Structured Streaming fast and scalable execution engine together with stream-oriented building blocks to allow easy extension to new production grade models. We found real-time anomaly detection to provide powerful capabilities in many different fields, internally we use the framework for a variety of use cases ranging from fraud prevention, operations and even security.

Speaker: Guy Gerson

Big Data Developer @PayPal

Guy Gerson is a Software Engineer on PayPal’s next generation stream processing platform core team. He is currently working on the adaptation of Statistical and Machine learning methodologies as part of real-time data pipelines. Prior to PayPal, He was a Researcher on the IBM Cloud and Data Technologies group focusing on designing large scale Internet of Things analytics architectures.

Find Guy Gerson at

2019 Tracks

  • Predictive Data Pipelines & Architectures

    Case Study focused look at end to end predictive pipelines from places like Salesforce, Uber, Linkedin, & Netflix

  • Sequential Data: Natural Language, Time Series, and Sound

    Techniques, practices, and approaches around time series and sequential data. Expect topics including image recognition, NLP/NLU, preprocess, & crunching of related algorithms.

  • ML in Action

    Applied track demonstrating how to train, score, and handle common machine learning use cases, including heavy concentration in the space of security and fraud