Winning With Big Data: Secrets of the Successful Data Scientist

Preview:

DESCRIPTION

The world is experiencing an Industrial Revolution of Data. In any given minute the machines around us are tracking billions of mouse clicks, credit card swipes, and GPS coordinates. And increasingly this data is being saved, aggregated, and analyzed. These massive data flows present big challenges to firms, but also new opportunities for deriving insights. Presented at the June 2010 gathering of the Bay Area's Business Intelligence Special Interest Group.

Citation preview

WINNINGWITH

BIG DATA

Michael Driscoll@dataspora

SDForum BI SIGJune 15, 2010

Secrets of the Successful

Data Scientist

WHY DATAMATTERSNOW

THE INDUSTRIALAGE OF DATA

WHAT IS BIG DATA?

Data that is distributed.

class size manage with how it fits examples

small < 10 GB Excel, Rfits in one machine’s memory

thousands of sales figures

medium 10GB-1TB indexed files, monolothic DB

fits on one machine’s disk millions of web pages

Big > 1TBHadoop,

distributed DBs

stored across many

machinesbillions of web clicks

WHAT ISDATA SCIENCE?

WHY DATA SCIENCEIS SEXY

+ =

“The sexy job in the next ten years will be statisticians…”- Hal Varian

data model

1000 bytes 2 bytes

9 WAYS TO WINWITH DATA

1. CHOOSE THERIGHT TOOL

You don’t need a chainsaw to cut butter.

2. COMPRESS EVERYTHING

The world is IO-bound.

mysqldump -u myuser -p mypass sourceDB | \ gzip | ssh mike@dataspora.com "cat - | \ gunzip | mysql -u myuser -p mypass targetDB"

3. SPLIT UPYOUR DATA

Split, apply, combine.

4. WORK WITH SAMPLES

Big Data is heavy, samples are light.

perl -ne "print if (rand() < 0.01)" \ data.csv > sample.csv

5. USESTATISTICS

6. COPYFROM OTHERS

Use open source.

git clone git://github.com/kevinweil/hadoop-lzo

Charts are compositions,not containers.

7. ESCHEW CHART TYPOLOGIES

8. COLOR WITH CARE

Color can enhance or insult.

9. TELL A STORY

People are listening.

ONE SUCCESSSTORY

WHY DO TELCO CUSTOMERS LEAVE?

Sign up Leave

Goal: “less churn.”

DATA:BILLIONSOF CALLS

… and millions of callers.

… a difference,but not significant.

DOES CALL QUALITYMATTER?

Hmmm...

WHAT ABOUTSOCIALNETWORKS?

… but is it predictive?

BUILD THE CALL GRAPH

April

EVOLUTION OF A CALL GRAPH

May

EVOLUTION OF A CALL GRAPH

June

EVOLUTION OF A CALL GRAPH

July

EVOLUTION OF A CALL GRAPH

when a cancellationoccurs in a call network.

700% INCREASEIN CHURN

FINAL THOUGHTS

Big Data Dedicated RDBMS

Analytics(R, SPSS, SAS, SAP)

Data Products (Content Filters, Rec Engines)

Data

Actions

Insights

THE BIG DATA STACK

THANKS!QUESTIONS?

Michael Driscollmed@dataspora.com

@dataspora on Twitterhttp://www.dataspora.com/blog

SDForum BI SIGJune 15, 2010

Recommended