24
Design Challenges for Real Predictive Platforms Max Gasner @gasnerpants

Strata 2014: Design Challenges for Real Predictive Platforms

Embed Size (px)

DESCRIPTION

The first databases were tightly coupled to their implementation details and use cases, until the relational revolution opened up the field and made database systems flexible enough to support a wide variety of applications with minimal configuration. What will it take to make predictive systems as ubiquitous and easy to use as databases? We’ll discuss the crucial design criteria for future predictive platforms and the kinds of interfaces they need to be able to support, as well as the challenges that lie between the state of the art and the future we envision.

Citation preview

Page 1: Strata 2014: Design Challenges for Real Predictive Platforms

Design Challenges for Real Predictive

PlatformsMax Gasner

@gasnerpants

Page 2: Strata 2014: Design Challenges for Real Predictive Platforms

What I’ve Been Doing! Navia Systems (2010-2011)

- probabilistic programming ! Prior Knowledge (2011-2012)

- the Veritable API (went live in July 2012) ! acquired by salesforce.com (2012–2014)

- predictive intelligence team !

APIs and cloud services to expose nonparametric Bayes (!?) …. to a general audience (?!)

Page 3: Strata 2014: Design Challenges for Real Predictive Platforms

Predictive Platforms?! Methods have advanced to support flexible use !

! Market is getting there - lots of data, many frustrated business users !

! Let’s not mistake specialist problems for the wider need !

! For most business problems, it’s a cold start and the competition is no predictive solution

Page 4: Strata 2014: Design Challenges for Real Predictive Platforms

The Database Analogy“Just” deterministic storage, collation, query, sorting, aggregation Early database systems were purpose-built by consultants and tied tightly to implementation details

RETRIEVE FORCE STATUS WITH RUNWAY LENGTH > 8000, GCD(DENVA)>GCD(DENVA,BEVENS) THEN LIST AFLD NAME, GCD, RUNWAY LENGTH (System 473L, 1966)

Relational database revolution (largely) decoupled schema from storage, interface from implementation

Page 5: Strata 2014: Design Challenges for Real Predictive Platforms

The Database Analogy

Indexing QueryIngest from many sources; data is typed

Sensible defaults, but highly configurable and extensible by experts

Flexible query; some queries will fail; sensible defaults, but highly configurable and extensible by experts

Many clients; databases outlive initial applications

Modeling “Prediction”

Page 6: Strata 2014: Design Challenges for Real Predictive Platforms

Database Lessons! “Decouple” implementation so users can be productive at

different levels of abstraction !

! Extensive gains (more applications are possible) and intensive gains (applications are easier to develop and maintain) !

! Quantity >> quality: more is much more than better

Page 7: Strata 2014: Design Challenges for Real Predictive Platforms

Database Lessons“Everyone” writes SQL

SELECT * FROM Patients WHERE Icd9 LIKE “250” AND DischargeDate = 2/11/2014

!It needs to be this easy to deploy and query models

INFER WillReadmit FROM (SELECT * FROM Patients WHERE Icd9 LIKE “250” AND DischargeDate = 2/11/2014)

Page 8: Strata 2014: Design Challenges for Real Predictive Platforms

Desiderata for Real Platforms! Be robust to the real world: data is messy, users are

inexperienced, and problems are underspecified

! Be honest about limitations: fail gracefully, but always fail when to do otherwise would be misleading

! Be flexible to changes in the structure of data and the questions that matter

! Be simple to use and provide the basic building blocks for complex applications (but don’t try to solve language, vision, and dating)

Page 9: Strata 2014: Design Challenges for Real Predictive Platforms

Robust! Far more data is usually available than is understood ! Every dataset has missing values ! Every value is noisy !

! Systems shouldn’t fail in the presence of irrelevant or partially observed data

! Systems should be conservative in the face of uncertainty

Page 10: Strata 2014: Design Challenges for Real Predictive Platforms

Honest! No system is adequate to every problem or dataset ! Some mistakes are expensive and some are cheap ! Black boxes are easy to use and hard to trust !

! Systems should provide measures of uncertainty ! Systems should explain their reasoning (in the sense of

EXPLAIN)

Page 11: Strata 2014: Design Challenges for Real Predictive Platforms

Flexible! The world isn’t a real-valued matrix ! Modeling choices shouldn’t mean we fake our datatypes ! The world is nonstationary and every predictive problem is

streaming !

! Systems should handle heterogeneous data natively ! Systems should retrain (and validate) continuously

Page 12: Strata 2014: Design Challenges for Real Predictive Platforms

Simple! Predictive systems need to be easy to engineer with ! And they need to be easy to engineer ! The business user and the modeler both have valid interests

in a predictive system and both need to be able to use it !

! Systems should be decoupled into pieces ! Systems should expose a small set of operations that

can compose to form complex predictive systems

Page 13: Strata 2014: Design Challenges for Real Predictive Platforms

Case Study: BayesDB! Built on flexible general model for

denormalized flat data tables* ! Separates index backend(s) from query

frontend ! Exposes query interface through SQL-

like language, “BQL” ! Open-source project (looking for

hackers) ! http://probcomp.csail.mit.edu/bayesdb

*Exercise for the reader: what about relational, graph, free text, and time series data?

@vmansinghka

Page 14: Strata 2014: Design Challenges for Real Predictive Platforms

Building Blocks! ANALYZE

Construct models (like table views) from the dataset (table)

! SIMULATE Generate new (unobserved) rows like those in the table

! INFER Fill in “missing” (or target) values for partially-observed rows

! ESTIMATE PAIRWISE DEPENDENCE PROBABILITIES (!!) Exposes the structure of the learned model

Page 15: Strata 2014: Design Challenges for Real Predictive Platforms

Separate Concerns! ANALYZE abstracts “what is doing the analysis”, decouples

model choice, inference strategy, validation from query

! Enables heterogeneous backends, on-the-fly model selection, incremental model updates, cost-based modeling

! Analyses of the same data might treat it differently for different purposes

! Challenge: training the predictive DBA?

Page 16: Strata 2014: Design Challenges for Real Predictive Platforms

Flexible Query! SIMULATE exposes the joint distribution but no actual

values (anonymization, synthetic data generation)

! INFER supports traditional single-valued prediction, but also joint prediction, conditioned on any combination of values

! Flexibility goes hand in hand with consistency: expect that the results will be consistent in distribution

! Challenge: exposing query to the interactive end user?

Page 17: Strata 2014: Design Challenges for Real Predictive Platforms

Structure Discovery! ESTIMATE PAIRWISE DEPENDENCE PROBABILITIES

(eppdepp?) exposes the structure of the model

! Moving to broader measures of dependence than correlation

! Structure is key for iterative, exploratory workflows

! Structure feeds into optimization of query and inference strategies

! Challenge: representing structural uncertainty?

Page 18: Strata 2014: Design Challenges for Real Predictive Platforms

Expose Uncertainty! Values should come with error bars, and explanations of

how they were derived

! Automated systems can use uncertainty to make cost-benefit tradeoffs (do show this ad, but don’t let this patient be discharged without this test being reviewed)

! Uncertainty lets us to move beyond anecdotes

! Challenge: uncertainty is hard to understand and explain

Page 19: Strata 2014: Design Challenges for Real Predictive Platforms

Hard Problems: Getting Data In! : Source vs. Dataset

(vs. Transformation, vs. Multi-Dataset, ….)

! Can we add more semantics to data definitions and schemas, to lever our prior knowledge?

! Can we use cloud services/crowdsource to better transform and interpret inputs?

! We need to design the entire data collection and storage pipeline to better support analytic consumers of data

Page 20: Strata 2014: Design Challenges for Real Predictive Platforms

Hard Problems: Getting Results Out! Not clear what the right default presentation is

! Much work to be done in exposing and explaining predicted values and uncertainty

! As predictive systems start to support UIs (beyond News Feed), we need to design new paradigms for interaction with imputed and uncertain values

! It’s hard to form mental models of reactive/adaptive systems

Page 21: Strata 2014: Design Challenges for Real Predictive Platforms

Hard Problems: PL/BQL?! The holy grail of “custom data types” — columns with

custom models written in probabilistic programming languages

! Think ICD9: we have a really strong prior (medicine + biology) but no way to express it, let alone do inference

! Put domain-specific modeling in the hands of “anyone”

! How many people have written some PL/SQL? How many people have written production database internals?

Page 22: Strata 2014: Design Challenges for Real Predictive Platforms

Predictive in the Ecosystem! Today: many specialized views of data (extending the basic

OLAP/OLTP distinction for a new era of bigger data and new demands)

! Tomorrow: predictive services as true services inter pares, with many clients of their own, deriving data from the same sources of truth as other services

! Lots of work to do flowing provenance, prior knowledge, and schemata through the entire pipeline

Page 23: Strata 2014: Design Challenges for Real Predictive Platforms

Ecosystem?! Let a hundred flowers blossom, let a hundred general

purpose predictive platforms contend

! Lots of uncertainty about the right (combination of) models to support the interface

! Lots of room to innovate on API and presentation

! Many problems in business very eager for credible solutions

! Database analogy: we are waiting for Codd

Page 24: Strata 2014: Design Challenges for Real Predictive Platforms

Thanks!Max Gasner

@gasnerpants