90
Democratizing Data Science in the Enterprise

Democratizing Data Science in the Enterprise

Embed Size (px)

Citation preview

Page 1: Democratizing Data Science in the Enterprise

Democratizing Data Science in the Enterprise

Page 2: Democratizing Data Science in the Enterprise

Better Title: The NO BS Guide to Getting Insights from your

Business Data

Page 3: Democratizing Data Science in the Enterprise

About Me

• Hackerpreneur• Founder of Tellago • Founder of KidoZen• Board member• Advisor: Microsoft, Oracle• Angel Investor• Speaker, Author

http://jrodthoughts.comhttps://twitter.com/jrdothoughts

Page 4: Democratizing Data Science in the Enterprise

Agenda

• A brief history of data science• Democratizing data science in the enterprise• Building a great data science infrastructure• Solving the last mile usability challenge

Page 5: Democratizing Data Science in the Enterprise

Key Takeaways

• How to build data science solutions in the real world without breaking the bank?

• What technologies can help?• Myths and realities of data science solutions

Page 6: Democratizing Data Science in the Enterprise

Data Science….Still Magic?

Page 7: Democratizing Data Science in the Enterprise

It ’s not a trick, it ’s an illusion.

Page 8: Democratizing Data Science in the Enterprise

Any sufficientlyadvanced technology isindistinguishable from

magic.— Arthur C. Clarke

Page 9: Democratizing Data Science in the Enterprise
Page 10: Democratizing Data Science in the Enterprise
Page 11: Democratizing Data Science in the Enterprise

1. create technology:people who are not experts canuse it easily with little difficultyand trust the output

2. make it “sufficiently advanced”

Page 12: Democratizing Data Science in the Enterprise

“data science”

d. conway, 2010

1

Page 13: Democratizing Data Science in the Enterprise

Basic Research

Applied Research

WorkingPrototype

Quality Code

Tool orService

Maybe someday, someone can use this.

I might be able to use this.

I can use this (sometimes).

Software engineers can use this.

People can use this.

Page 14: Democratizing Data Science in the Enterprise

The Wizard….The Data Scientist

Page 15: Democratizing Data Science in the Enterprise
Page 16: Democratizing Data Science in the Enterprise

Fred Benenson@fredbenenson,n -

Fol lowing

IMHO the majority of data work boils down to

3 th ings:

1. Counting stuff

2. Figuring out the denominator

3. The reproducibility of 1 & 2• *

RETWEETS

32FAVORITES

28

12:33 PM - 21 Aug 2013

Page 17: Democratizing Data Science in the Enterprise

They’re hot these days…

Page 18: Democratizing Data Science in the Enterprise

1

Page 19: Democratizing Data Science in the Enterprise

1

Page 20: Democratizing Data Science in the Enterprise

2

Page 21: Democratizing Data Science in the Enterprise

2

Page 22: Democratizing Data Science in the Enterprise

“data science”

jobs, jobs, jobs

2

Page 23: Democratizing Data Science in the Enterprise

“data science”

jobs, jobs, jobs

2

Page 24: Democratizing Data Science in the Enterprise

Where do they come from?

Page 25: Democratizing Data Science in the Enterprise

“data science”

ancient history: 2001

Page 26: Democratizing Data Science in the Enterprise

“The Future of Data Analysis,”

W.

1962

John Tukey

Page 27: Democratizing Data Science in the Enterprise

introduces:

“Exploratory data anlaysis”

2

Page 28: Democratizing Data Science in the Enterprise

Tukey 1965, via John Chambers

Page 29: Democratizing Data Science in the Enterprise

TUKEY BEGAT S WHICH BEGAT R

30hackNYDS.key -

Thursday:June.18

Page 30: Democratizing Data Science in the Enterprise

Tukey 1972

3

Page 31: Democratizing Data Science in the Enterprise

? 1972

3

Page 32: Democratizing Data Science in the Enterprise

Jerome H. Friedman

3

Page 33: Democratizing Data Science in the Enterprise

TUKEY BEGAT ESL

3

Page 34: Democratizing Data Science in the Enterprise

TUKEY BEGAN VDQI

3

Page 35: Democratizing Data Science in the Enterprise

Tukey 1977

3

Page 36: Democratizing Data Science in the Enterprise

TUKEY BEGAT EDA

3

Page 37: Democratizing Data Science in the Enterprise

fast forward -> 2001

3

Page 38: Democratizing Data Science in the Enterprise

Data Science in the Enterprise

Page 39: Democratizing Data Science in the Enterprise

Seems like magic…

Page 40: Democratizing Data Science in the Enterprise

But it boils down to 2 factors….

Page 41: Democratizing Data Science in the Enterprise

Data Science Success Factors in the Enterprise

• Building a great data science infrastructure

• Solving the last mile problem

Page 42: Democratizing Data Science in the Enterprise

Tricks to build a great data science infrastructure

Page 43: Democratizing Data Science in the Enterprise

Trick#1: Centralized Data Aggregation…

Page 44: Democratizing Data Science in the Enterprise

Goals & Challenges

• Correlate data from disparate data sources

• Enable a centralized data store for your enterprise

• Incorporate new information sources in an agile way

• Traditional multi-dimensional data warehouses are difficult to modify

• They are designed around a specific set of questions (schema-first)

• Challenges to incorporate semi-structure and unstructured data

I would like to… But…

Page 45: Democratizing Data Science in the Enterprise

Centralized Data Aggregation: Best Practices

• Implement an enterprise data lake

• Rely on big data DW platforms such as Apache Hive

• Use a federated architecture efficiently partitioned for different business units

• Establish SQL as the common query language

• Leverage in-memory computing to optimize query performance

Page 46: Democratizing Data Science in the Enterprise

Centralized Data Aggregation: Technologies & Vendors

Page 47: Democratizing Data Science in the Enterprise

Trick#2: Data Discovery…

Page 48: Democratizing Data Science in the Enterprise

Goals & Challenges

• Organically discover data sources relevant to my job

• Help others discover data more efficiently

• Collaborate with colleagues about specific data sources

• Business users typically don’t have access to the data lake

• There is no corporate data repository

• There is no search and metadata repository

I would like to… But…

Page 49: Democratizing Data Science in the Enterprise

Data Discovery: Best Practices

• Implement a corporate data catalog

• The data catalog should be the user interface to interact with the corporate data lake

• Copy ideas from data catalogs in the internet

• Provide rich metadata experience in your data catalog

• Extend your data lake with search capabilities

Page 50: Democratizing Data Science in the Enterprise

Data Discovery: Technologies & Vendors

Page 51: Democratizing Data Science in the Enterprise

Trick#3: Establish a Common Query Language…

Page 52: Democratizing Data Science in the Enterprise

Goals & Challenges

• Query data from different business systems in a consistent way

• Correlate information from different line of business systems

• Reuse queries as new sources of information

• Different business systems use different protocols to query data

• I need to learn a new query language to interact with my big data infrastructure

• Queries over large data sources can be SLOW

I would like to… But…

Page 53: Democratizing Data Science in the Enterprise

Query Language: Best Practices

• Standardize on SQL as the language query business data

• Implement a SQL interface for your data lake

• Correlate data sources using simple SQL joins

• Materialize query results in your data lake for future reuse

• Invest in in-memory technologies to optimize performance

Page 54: Democratizing Data Science in the Enterprise

Query Language: Technologies & Vendors

Page 55: Democratizing Data Science in the Enterprise

Trick#4: Focus on Data Quality…

Page 56: Democratizing Data Science in the Enterprise

Goals & Challenges

• Trust corporate data for my applications

• Actively merge new and historical data

• Integrate new data back into line of business systems

• Data in line of business systems in poorly curated

• Some data records need to be validated or cleanse

• Some data records need to be enriched with additional data points

I would like to… But…

Page 57: Democratizing Data Science in the Enterprise

Data Quality: Best Practices

• Implement a data quality process

• Leverage your data catalog as the main user interface to control data quality

• Trust the wisdom of the crowds to manage data quality

• Provide a great user experience to data quality

Page 58: Democratizing Data Science in the Enterprise

Data Quality: Technologies & Vendors

Page 59: Democratizing Data Science in the Enterprise

Trick#5: Understand your data….

Page 60: Democratizing Data Science in the Enterprise

Goals & Challenges

• Execute efficient queries against my corporate data

• Discover patterns and trends about business data sources

• Rapidly adapt to new data sources added to our business processes

• There is no simple way to understand corporate data sources

• We rely on users to determine which queries to execute

• New data patterns and trends often go undetected

I would like to… But…

Page 61: Democratizing Data Science in the Enterprise

Understanding your Data : Best Practices

• Leverage machine learning algorithms to understand business data sources

• Leverage clustering algorithms to detect interesting patterns from your business data

• Leverage classification algorithms to place data records in well-defined groups

• Leverage statistical distribution algorithms to reveal interesting information about your data

Page 62: Democratizing Data Science in the Enterprise

Understanding your Data : Technologies & Vendors

Page 63: Democratizing Data Science in the Enterprise

Trick#6: Predict…

Page 64: Democratizing Data Science in the Enterprise

Goals & Challenges

• Efficiently predict well-known variables in my business data

• Adapt results to future predictions

• Take actions based on the predicted outcomes

• Our analytics are based on after-the-fact reports

• Traditional predictive analytics technologies don’t work well with semi-structured and unstructured data

• Traditional predictive analytics require complex infrastructure

I would like to… But…

Page 65: Democratizing Data Science in the Enterprise

Predict : Best Practices

• Implement a modern predictive analytics platform

• Leverage the data lake as the main source of information to predictive analytics algorithms

• Leverage classification and clustering algorithms as the main mechanisms to train predictions

• Expose predictions to other applications for future reuse

Page 66: Democratizing Data Science in the Enterprise

Predict : Technologies & Vendors

Page 67: Democratizing Data Science in the Enterprise

Trick#7: Take Actions…

Page 68: Democratizing Data Science in the Enterprise

Goals & Challenges

• Not have to read a report to take actions on my business data

• Model automatic actions based on well-defined data rules

• Evaluate the effectiveness of the rules and adapt

• Data results are mostly communicated via reports and dashboards

• There is no interface to design rules against business data

• Actions are implemented based on human interpretation of data

I would like to… But…

Page 69: Democratizing Data Science in the Enterprise

Take Actions : Best Practices

• Implement a modern predictive analytics platform

• Leverage the data lake as the main source of information to predictive analytics algorithms

• Leverage classification and clustering algorithms as the main mechanisms to train predictions

• Expose predictions to other applications for future reuse

Page 70: Democratizing Data Science in the Enterprise

Take Actions: Technologies & Vendors

Page 71: Democratizing Data Science in the Enterprise

Trick#8: Embrace developers…

Page 72: Democratizing Data Science in the Enterprise

Goals & Challenges

• Leverage data analyses in new applications

• Help developers embrace corporate data infrastructure

• Expose data analyses to new mediums such as mobile or IOT

• Data results are mostly communicated via reports and dashboards

• Data analysis efforts are typically led by non-developers

• There is no easy way to organically discover and reuse corporate data sources

I would like to… But…

Page 73: Democratizing Data Science in the Enterprise

Leverage Developers: Best Practices

• Expose data sources and analyses via APIs

• Leverage industry standards to integrated with third party tools

• Provide data access samples and SDKs for different environments such as mobile and IOT clients

• Incorporate developer’s feedback into your data sources

Page 74: Democratizing Data Science in the Enterprise

Take Actions: Technologies & Vendors

Page 75: Democratizing Data Science in the Enterprise

Trick#9: Real time data is different…

Page 76: Democratizing Data Science in the Enterprise

Goals & Challenges

• Process large volumes or real time data

• Aggregate real time and historical data

• Detect and filter conditions in my real time data before it goes into corporate systems

• There is no infrastructure to query real time data

• We process real time and historical data using the same models

• Large data volumes affect performance

I would like to… But…

Page 77: Democratizing Data Science in the Enterprise

Real Time Data Processing: Best Practices

• Implement a stream analytics platform

• Model queries over real time data streams

• Add the results of the aggregated queries into the data lake

• Replay data streams to simulate real time conditions

Page 78: Democratizing Data Science in the Enterprise

Real Time Data Processing: Technologies & Vendors

Page 79: Democratizing Data Science in the Enterprise

Solving the last mile problem

Page 80: Democratizing Data Science in the Enterprise

Trick#1: Killer user experience…

Page 81: Democratizing Data Science in the Enterprise

Create a Killer User Experience

• Design matters

• Invest on a easy way for users to interact with corporate data source

• Leverage modern UX principles that work cross channels(mobile, web)

• Make data discoverable

• Leverage metadata

• Facilitate collaboration

Page 82: Democratizing Data Science in the Enterprise

Trick#2: Test test test…

Page 83: Democratizing Data Science in the Enterprise

Test Test Test

• Incorporate test models into your data sources

• Simulate real world conditions at the data level

• Assume everything will fail

Page 84: Democratizing Data Science in the Enterprise

Trick#3: Integrate with existing tools…

Page 85: Democratizing Data Science in the Enterprise

Integrate with Third Party Tools

• Integrate your data lake with mainstream tools like Tableau or Excel

• Use industry standards so that data sources can be incorporated

Page 86: Democratizing Data Science in the Enterprise

Trick#5: Collaborate…

Page 87: Democratizing Data Science in the Enterprise

Collaborate

• Integrate data sources with modern messaging and collaboration tools: Slack, Yammer etc

• Distribute updates via emails, push notifications, SMSs

Page 88: Democratizing Data Science in the Enterprise

Other things to consider

• On-premise, cloud or hybrid?

• Apply agile development practices to your data science infrastructure

• Infrastructure is cool but usability is more important

Page 89: Democratizing Data Science in the Enterprise

Summary

• Data science is not magic, is an illusion

• Implementing data science in the enterprise is about solving two problems• Building a great data infrastructure

• Solving the last mile usability challenge

• Today this can be done with commodity technology

• Data scientists are just “people “ ;)

Page 90: Democratizing Data Science in the Enterprise

THANKSJesus Rodriguez

https://twitter.com/jrdothoughts

http://jrodthoughts.com/