Operationalizing analytics to scale

Scott HooverOperationalizing Analytics To Scale

Operationalizing Analytics To Scale

Many companies have invested time and money into building sophisticated data pipelines that can move massive amounts of data in (near) real time. However, for the analyst or data scientist who builds models offline, integrating their analyses into these pipelines for operational purposes can pose a challenge.

In this workshop, we will discuss some key technologies and workflows companies can leverage to build end-to-end solutions for automating analytical, statistical and machine learning solutions: from collection and storage to analysis and real-time predictions.

Abstract

Agenda

● Introduction

Agenda

● Introduction● What Are we Talking About Exactly?

Agenda

● Introduction● What Are we Talking About Exactly?● The Problem at Hand

Agenda

● Introduction● What Are we Talking About Exactly?● The Problem at Hand● Operationalizing Analytics

Agenda

● Introduction● What Are we Talking About Exactly?● The Problem at Hand● Operationalizing Analytics● Operationalizing Predictive Analytics

Agenda

● Introduction● What Are we Talking About Exactly?● The Problem at Hand● Operationalizing Analytics● Operationalizing Predictive Analytics● Questions

Agenda

Introduction

● I work on the Internal Data team at Looker.

Introduction


● Before Looker, I worked in consulting and research.

Introduction


● Before Looker, I worked in consulting and research.

● Looker is a business intelligence tool.

What are we talking about?

● What do I mean when I say “operationalizing”?

What are we talking about?

● What do I mean when I say “operationalizing”?

● Why is this important?

The Problem at Hand

● Analysts are providing basic reports for the entire business.

● Analysts are providing basic reports for the entire business.

● Analysts and Data Scientists are building offline models.

The Problem at Hand

The Problem With Offline Models

● Offline analyses aren’t associated with particularly quick turnaround times.



● Offline analyses aren’t particularly collaborative.



● Offline analyses aren’t particularly collaborative.

● Offline analyses aren’t particularly portable.

A Potential Set-up (Straw Man)

Data Sources

http

Data Stores

query

Analysis Consumption

Operationalizing Analytics - The Simple Case


● These metrics are vanilla.


● These metrics are critical.



● These metrics are critical.

● The business would probably better served if Data Scientists and Analysts were spending their time answering questions that require deep technical knowledge.


● Build or buy a workhorse ETL tool.

Operationalizing Analytics - A How To


● Move toward an Operational Data Store (ODS), reducing the need for postprocessing and data “mashups.”




● Emphasize self-service wherever possible.




● Emphasize self-service wherever possible.

● Analytics should slot into existing the infrastructure with minimal friction.


Operationalizing Predictive Analytics

Where to Begin

● Out-of-the-box tools.


● Build from scratch.

Where to Begin


● Build from scratch.

● A mean between extremes.

Where to Begin

● XML-based, model-storage format.

A Model Standard - PMML


● Created and maintained by the Data Mining Group.



● Created and maintained by the Data Mining Group.

● Most commonly used statistical/machine learning models are supported.


PMML IntegrationsProducers Consumers

JPMML● JPMML is an open-source API for evaluating PMML files.

https://github.com/jpmml


● In essence, we equip the JPMML application with our PMML file, serve it up with new data, and it provides us with predictions.




● Openscoring.io distributes various JPPML APIs and UDFs—for example, RESTful API, Heroku, Hive, Pig, Cascading and PostgreSQL.




● Openscoring.io distributes various JPPML APIs and UDFs—for example, RESTful API, Heroku, Hive, Pig, Cascading and PostgreSQL.

● All we have to do is write some code that fetches new values, serves them up to the JPMML API, captures the predictions, then pushes them back to a database.


Example Architecture - Lead Scoring

API

API

GET lead

UPDATE lead

GET leads

Heroku: git push heroku master

REST: curl -X PUT --data-binary @BayesLeadScore.pmml -H "Content-type: text/xml" http://ec2_endpoint/openscoring/model/BayesLeadScore

Deploy Model - PUT /model/${id}

CURLing or navigating to http://heroku_endpoint/openscoring/model/BayesLeadScore or http://ec2_endpoint/openscoring/model/BayesLeadScore will display our pmml model.

View Model - GET /model/${id}

http://heroku_endpoint/openscoring/model/BayesLeadScore

http://heroku_endpoint/openscoring/model/BayesLeadScore

Test Model - POST /model/${id}

newLead.json

curl -X POST \--data-binary @newLead.json \-H "Content-type: application/json" \http://ec2_endpoint/openscoring/model/BayesLeadScore

Send request to JPMML API{“id” : “001”,

“arguments” : {“country” :

“US”,“budget” :

7.8}}

Example Response

{“id” : “001”,

“result” : {“meeting” : “1”,“Probability_0” :

0.33062906130485653,“Probability_1” : 0.6693709386951435}}

Batch Request - POST /model/${id}/batchbatchLeads.json

curl -X POST --data-binary \@batchLeads.json -H "Content-type: \ application/json" \ http://ec2_endpoint/openscoring/model/BayesLeadScore/batch

Send request to JPMML API{ "id":"batch-1", "requests":[ { "id":"001", "arguments":{ "country":"US", "budget":7.8 } }, { "id":"002", "arguments":{ "country":"CA", "budget":3.2 } } ]}

Scale Considerations

Scale Considerations● Horizontal scaling.

Scale Considerations● Horizontal scaling.

● Vertical scaling.

What About Truly Big Data?● For the rare few of us who need to make real-time predictions

against millions of rows per second, there’s a popular apache suite to handle this.

*image borrowed from OryxProject

Applications

ODS Analysis

APIs

Transactional DB / Event Storage

Business Intelligence

Scoring Server

Consumers Review / Versioning

Closing Thoughts

Questions

Learn more at looker.com/demo

Data & Analytics

Operationalizing analytics to scale