39
1 © 2014 The MathWorks, Inc. Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7 th October 2014

Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

1© 2014 The MathWorks, Inc.

Data Analytics with MATLAB

Tackling the Challenges of Big Data

Adrienne James, PhD

MathWorks

7th October 2014

Page 2: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

2

Big Data in Industry

ENERGYAsset Optimization

FINANCEMarket Risk, Regulatory

AUTOFleet Data Analysis

AEROMaintenance, reliability

Medical DevicesPatient Outcomes

Page 3: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

3

PROCESSING OPTIONS

• MATLAB RESTful interface to Cluster

• MATLAB Hadoop Streaming

• NoSQL connector (e.g. mongo)

• MATLAB / Java App accessing Cluster

• MATLAB Map-Reduce Components

Page 4: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

4

Key takeaways

New functions for analysing data that does not fit in memory on your

desktop

– datastore

– mapreduce

& that can scale for use with Hadoop

Additional techniques for predictive modelling with large data

– Work with large data in memory on a cluster (spmd)

Deploy predictive models

– Bring MATLAB analytics to the Web

– Share analytics with a wider community of users

Page 5: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

5

How big is big? What characterises “big” data?

Wikipedia

“Any collection of data sets so large and complex that it becomes difficult to

process using … traditional data processing applications.”

Volume : amount of data

Velocity : speed at which data is generated or needs to be analysed

Variety : range of data types/data sources

Page 6: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

6

Considerations: Large Data AnalyticsData Characteristics

1. Size & type of data?

2. Where is your data?

3. What hardware do you have access to?

4. Analysis Characteristics?

Page 7: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

7

Example: Airline Delay Analysis

Data

– BTS/RITA Airline On-Time Statistics

– 123.5M records, 29 fields

Analysis Tasks

– Calculate delay patterns

– Visualize summaries

– Estimate & evaluate predictive models

Page 8: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

8

Considerations: Large Data AnalyticsAirline Data Characteristics

1. Size & type of data?

CSV Data

22 files

12GB

Page 9: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

9

Considerations: Large Data AnalyticsData Characteristics

1. Size & type of data?

2. Where is my data?• Small subset available locally

• Entire data set stored elsewhere

Page 10: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

10

Big Data Analysis with MATLAB – start on the desktop

Explore

Prototype

Scale

Access Share/Deploy

Work on your desktop

Start “simple”

Basic statistics

Explore data

Page 11: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

11

Demo: Exploring departure delays using datastore

Explore approaches pre- & post-

Start with a small subset …

What happens as the data size grows?

…. until eventually it does not fit in memory on your desktop machine

datastore

Page 12: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

12

Access & explore bigger data on the desktop more easily

Easily specify data set

– Single text file (or collection of text files)

– Database (using Database Toolbox)

Preview data structure and format

Customise data to import

using column names

Incrementally read

subsets of the data

airdata = datastore('*.csv');

airdata.SelectedVariables = {'Distance', 'ArrDelay‘};

data = read(airdata);

datastore

Page 13: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

13

load

datastore extends Data Access Landscape

SMALL Increasing Data Size

memmapfile

matfile

API

databasedatabase.

ODBCConnection

Text files

Databases

.MAT files

Binary files

Images

textscan,

readtable

+programming

ImageAdapterimread, …

fread, …

SystemObjectsstreaming data

post-

readtable

Import

Tool

datastoretextscan

pre-

Page 14: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

14

Considerations: Large Data AnalyticsData Characteristics

1. Size & type of data?

2. Where is your data?

3. What hardware do you have access to?

4. Analysis Characteristics Initially, simple statistics & data exploration

• Small subset available locally

• Entire data set stored elsewhere

Page 15: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

15

Big Data Analysis with MATLAB

Explore

Prototype

Scale

Access Share/Deploy

Scale to a cluster

Start locally and then …..

Page 16: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

16

Datastore

HDFS

Reduce

Node

Node

Node Data

Data

Data

Map

ReduceMap

ReduceMap

Map Reduce

Map

Map

Reduce

Reduce

What is ?

A Big Data Platform

Page 17: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

17

A bit of audience participation – mapreduce ….

Page 18: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

18

Introducing the mapreduce programming framework

Input filesIntermediate files

(local disk)Output files

Newspaper

pages

For each page how many

times do “Steve”, “Emily” and

“David” get mentioned?

Total

mentions

Steve 11%

Emily 58%

David 31%

Example:

National

popularity contest

Page 19: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

19

mapreduce concept – group counts

Map Reduce

Input filesIntermediate files

(local disk)Output files

Page 20: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

20

Demo: Exploring mapreduce

Page 21: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

21

Datastore

Explore and Analyze Data on Hadoop

MATLAB

MapReduce

Code

HDFS

Node Data

MATLAB

Distributed

Computing

Server

Node Data

Node Data

Map Reduce

Map Reduce

Map Reduce

Hadoop

ds = datastore('hdfs://myserver:7867/data/file1.txt');

Page 22: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

22

Considerations: Large Data AnalyticsData Characteristics

1. Size & type of data?

2. Where is your data?

3. What hardware do you have access to?

4. Analysis Characteristics Explore predictive modelling

Cluster

Page 23: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

23

Big Data Analysis with MATLAB

Explore

Prototype

Scale

Access Share/Deploy

Scale to a cluster

Options for more involved

algorithms ….

• may require all data in memory

• multiple iterations …

Page 24: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

24

Data Analytics Landscape

easily

partitioned;

independent

tasks

iterative

all data needed in

memory at once

SMALL Increasing Data Size

SIMPLE

COMPLEX

Algorithm

complexity

More programming

effort required

Built-in

numerical & statistical

algorithms

spmddistributed

arrays

gpuarray

parfor

vectorisationmapreduce

Page 25: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

25

Working with more “complex” algorithms with data in memory

on a cluster

MDCS

1987 1988 1989 1990 1991 1992

Instr

uctions

Reduced D

ata

Client

Page 26: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

26

Demo: Predictive Modelling

Logistic Regression & Neural Networks

10 busiest airport origins & 7 largest airline carriers

Explore & compare prediction quality of two models to predict flights delayed for more than

20 minutes

– Randomly partition data into test and training sets (cvpartition)

– Model #1: Logistic Regression

– Model #2: Neural Network

Predictor Variables: DayOfWeek,Origin,Airline,DepTime,Distance

Page 27: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

27

Single Program, Multiple Data

Lab 1

>> mycode

Lab 2

>> mycode

Lab 3

>> mycode

Lab 4

>> mycode

Page 28: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

28

Single Program, Multiple Data

Parallel Pool

Lab 1

Lab 2

Lab 3

Lab 4

Client

spmd

a = rand;

end

a = rand;

a = rand;

a = rand;

a = rand;

Cluster

Page 29: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

29

Explore Big Data

Explore

Prototype

Access Share/Deploy

Subset data by filtering or variable selection

and gain insight with visualization

Scale

Explore

Prototype

Scale

Access Share/Deploy

Page 30: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

30

Highlights: Airline Delay Analysis

Start small

Scale up

Quick prototyping on large data

Interactive exploration

Interspersed visualizations

Predictive modelling with large data

Page 31: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

31

Deploy

Explore

Prototype

Scale

Access Share/Deploy

Hadoop

Enterprise

WebDesktop

Page 32: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

32

Web Analytics: Analysis of traffic around Paris

http://rumeur.bruitparif.fr/

Page 33: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

33

Predictive Data Analytics – Load Demand Forecasting

Page 34: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

34

Demo

Station:

Page 35: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

35

MATLAB on Hadoop

Two modes of operation

Execute mapreduce on Hadoop from your MATLAB desktop using

MATLAB Distributed Computing Server

– Extends your desktop environment for use with Hadoop

– Execute algorithms within Hadoop MapReduce on data stored in HDFS

Create standalone applications or libraries for deploying to production

instances of Hadoop

– Locked down package for use in production environments

– Integration of MATLAB analytics with operational systems

Page 36: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

36

Key takeaways

New functions for analysing data that does not fit in memory on your

desktop

– datastore

– mapreduce

& that can scale for use with Hadoop

Additional techniques for predictive modelling with large data

– Work with large data in memory on a cluster (spmd)

Deploy predictive models

– Bring MATLAB analytics to the Web

– Share analytics with a wider community of users

Page 37: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

37

New Big Data Capabilities in MATLAB

Memory and Data Access

64-bit processors

Memory Mapped Variables

Disk Variables

Databases

Datastores

Platforms

Desktop (Multicore, GPU)

Clusters

Cloud Computing (MDCS on EC2)

Hadoop

Programming Constructs

Streaming

Block Processing

Parallel-for loops

GPU Arrays

SPMD and Distributed Arrays

MapReduce

Page 38: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

38

Additional Resources

MathWorks Web Site

Big Data With MATAB: http://www.mathworks.com/discovery/big-data-matlab.html

MapReduce & Hadoop: http://www.mathworks.com/discovery/matlab-mapreduce-hadoop.html

Machine Learning with MATLAB: http://www.mathworks.com/machine-learning/index.html

A selection of user stories

LiquidNet: Lean Data Analysis: The Awesome Data Dexterity of MATLAB Desktop

Ruuki Metals: Steel Manufacturing Process Analytics

CEESAR: Data Processing Framework Supporting Large Scale Driving Data Analysis

Daimler AG: Analyzing Test Data from a Worldwide Fleet of Fuel Cell Vehicles

Page 39: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in

39

Thank You