Upload
looker
View
493
Download
6
Embed Size (px)
Citation preview
Scott HooverOperationalizing Analytics To Scale
Operationalizing Analytics To Scale
Many companies have invested time and money into building sophisticated data pipelines that can move massive amounts of data in (near) real time. However, for the analyst or data scientist who builds models offline, integrating their analyses into these pipelines for operational purposes can pose a challenge.
In this workshop, we will discuss some key technologies and workflows companies can leverage to build end-to-end solutions for automating analytical, statistical and machine learning solutions: from collection and storage to analysis and real-time predictions.
Abstract
Agenda
● Introduction
Agenda
● Introduction● What Are we Talking About Exactly?
Agenda
● Introduction● What Are we Talking About Exactly?● The Problem at Hand
Agenda
● Introduction● What Are we Talking About Exactly?● The Problem at Hand● Operationalizing Analytics
Agenda
● Introduction● What Are we Talking About Exactly?● The Problem at Hand● Operationalizing Analytics● Operationalizing Predictive Analytics
Agenda
● Introduction● What Are we Talking About Exactly?● The Problem at Hand● Operationalizing Analytics● Operationalizing Predictive Analytics● Questions
Agenda
Introduction
● I work on the Internal Data team at Looker.
Introduction
● I work on the Internal Data team at Looker.
● Before Looker, I worked in consulting and research.
Introduction
● I work on the Internal Data team at Looker.
● Before Looker, I worked in consulting and research.
● Looker is a business intelligence tool.
What are we talking about?
● What do I mean when I say “operationalizing”?
What are we talking about?
● What do I mean when I say “operationalizing”?
● Why is this important?
The Problem at Hand
● Analysts are providing basic reports for the entire business.
● Analysts are providing basic reports for the entire business.
● Analysts and Data Scientists are building offline models.
The Problem at Hand
The Problem With Offline Models
● Offline analyses aren’t associated with particularly quick turnaround times.
The Problem With Offline Models
● Offline analyses aren’t associated with particularly quick turnaround times.
● Offline analyses aren’t particularly collaborative.
The Problem With Offline Models
● Offline analyses aren’t associated with particularly quick turnaround times.
● Offline analyses aren’t particularly collaborative.
● Offline analyses aren’t particularly portable.
A Potential Set-up (Straw Man)
Data Sources
http
Data Stores
query
Analysis Consumption
Operationalizing Analytics - The Simple Case
Operationalizing Analytics - The Simple Case
● These metrics are vanilla.
● These metrics are vanilla.
● These metrics are critical.
Operationalizing Analytics - The Simple Case
● These metrics are vanilla.
● These metrics are critical.
● The business would probably better served if Data Scientists and Analysts were spending their time answering questions that require deep technical knowledge.
Operationalizing Analytics - The Simple Case
● Build or buy a workhorse ETL tool.
Operationalizing Analytics - A How To
● Build or buy a workhorse ETL tool.
● Move toward an Operational Data Store (ODS), reducing the need for postprocessing and data “mashups.”
Operationalizing Analytics - A How To
● Build or buy a workhorse ETL tool.
● Move toward an Operational Data Store (ODS), reducing the need for postprocessing and data “mashups.”
● Emphasize self-service wherever possible.
Operationalizing Analytics - A How To
● Build or buy a workhorse ETL tool.
● Move toward an Operational Data Store (ODS), reducing the need for postprocessing and data “mashups.”
● Emphasize self-service wherever possible.
● Analytics should slot into existing the infrastructure with minimal friction.
Operationalizing Analytics - A How To
Operationalizing Predictive Analytics
Where to Begin
● Out-of-the-box tools.
● Out-of-the-box tools.
● Build from scratch.
Where to Begin
● Out-of-the-box tools.
● Build from scratch.
● A mean between extremes.
Where to Begin
● XML-based, model-storage format.
A Model Standard - PMML
● XML-based, model-storage format.
● Created and maintained by the Data Mining Group.
A Model Standard - PMML
● XML-based, model-storage format.
● Created and maintained by the Data Mining Group.
● Most commonly used statistical/machine learning models are supported.
A Model Standard - PMML
PMML IntegrationsProducers Consumers
JPMML● JPMML is an open-source API for evaluating PMML files.
JPMML● JPMML is an open-source API for evaluating PMML files.
● In essence, we equip the JPMML application with our PMML file, serve it up with new data, and it provides us with predictions.
JPMML● JPMML is an open-source API for evaluating PMML files.
● In essence, we equip the JPMML application with our PMML file, serve it up with new data, and it provides us with predictions.
● Openscoring.io distributes various JPPML APIs and UDFs—for example, RESTful API, Heroku, Hive, Pig, Cascading and PostgreSQL.
JPMML● JPMML is an open-source API for evaluating PMML files.
● In essence, we equip the JPMML application with our PMML file, serve it up with new data, and it provides us with predictions.
● Openscoring.io distributes various JPPML APIs and UDFs—for example, RESTful API, Heroku, Hive, Pig, Cascading and PostgreSQL.
● All we have to do is write some code that fetches new values, serves them up to the JPMML API, captures the predictions, then pushes them back to a database.
Example Architecture - Lead Scoring
API
API
GET lead
UPDATE lead
GET leads
Heroku: git push heroku master
REST: curl -X PUT --data-binary @BayesLeadScore.pmml -H "Content-type: text/xml" http://ec2_endpoint/openscoring/model/BayesLeadScore
Deploy Model - PUT /model/${id}
CURLing or navigating to http://heroku_endpoint/openscoring/model/BayesLeadScore or http://ec2_endpoint/openscoring/model/BayesLeadScore will display our pmml model.
View Model - GET /model/${id}
Test Model - POST /model/${id}
newLead.json
curl -X POST \--data-binary @newLead.json \-H "Content-type: application/json" \http://ec2_endpoint/openscoring/model/BayesLeadScore
Send request to JPMML API{“id” : “001”,
“arguments” : {“country” :
“US”,“budget” :
7.8}}
Example Response
{“id” : “001”,
“result” : {“meeting” : “1”,“Probability_0” :
0.33062906130485653,“Probability_1” : 0.6693709386951435}}
Batch Request - POST /model/${id}/batchbatchLeads.json
curl -X POST --data-binary \@batchLeads.json -H "Content-type: \ application/json" \ http://ec2_endpoint/openscoring/model/BayesLeadScore/batch
Send request to JPMML API{ "id":"batch-1", "requests":[ { "id":"001", "arguments":{ "country":"US", "budget":7.8 } }, { "id":"002", "arguments":{ "country":"CA", "budget":3.2 } } ]}
Scale Considerations
Scale Considerations● Horizontal scaling.
Scale Considerations● Horizontal scaling.
● Vertical scaling.
What About Truly Big Data?● For the rare few of us who need to make real-time predictions
against millions of rows per second, there’s a popular apache suite to handle this.
*image borrowed from OryxProject
Applications
ODS Analysis
APIs
Transactional DB / Event Storage
Business Intelligence
Scoring Server
Consumers Review / Versioning
Closing Thoughts
Questions
Learn more at looker.com/demo