44
1

Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

1

Page 2: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

2

Page 3: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

3

Out of Memory? No Problem. Developing Machine Learning Models on Big Data

Heather Gorr, PhD

MATLAB Product Marketing Manager

Page 4: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

4

Big data without big changes

One file One hundred files

Page 5: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

5

The big data landscape can seem overwhelming

Page 6: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

6

Building machine learning models with big data

Access, Preprocessing,

and Exploration

Model Validation and Scaling Up

Model Development

Page 7: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

7

Case study: Predict Air Quality in North America

Page 8: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

8

Building machine learning models with big data – step by step

Access, Preprocessing,

and Exploration

Model Validation and Scaling Up

Model Development

Page 9: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

9

Historical files are on HDFS and real time data are available through an API

• Temperature• Pressure• Relative Humidity• Dew Point• Wind speed • Wind direction• Ozone• CO• NO2• SO2

Page 10: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

10

You have 1TB of data you’ve never seen before. Where do you start?

Page 11: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

11

Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for machine learning.

HDFS

YARN

Spark

MATLAB

Page 12: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

12

Access and preview the data with datastore

Page 13: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

13

Databases

Images

MDF Files

Custom

Simulink

There are numerous datastores to access data in many forms

Page 14: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

14

Access air quality data using datastore

Page 15: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

15

Page 16: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

16

Access air quality data using datastore

Page 17: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

17

Preview the data and adjust properties to best represent the data of interest

Page 18: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

18

Use tall arrays to work with the data like any MATLAB array

Page 19: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

19

Create a tall array for each datastore

ozone

Page 20: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

20

Use familiar MATLAB functions on tall arrays

Page 21: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

21

Clean messy data using common preprocessing functions

Page 22: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

22

Execution model makes operations more efficient on big data

▪ Deferred evaluation– Commands are not executed right away

– Operations are added to a queue

▪ Execution triggers include:– gather function

– summary function

– Machine learning models

– Plotting

Page 23: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

23

Execution model makes operations more efficient on big data

Unnecessary results are not computed

Page 24: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

24

Explore the data with tall visualizations

plot

scatter

binscatter

histogram

histogram2

ksdensity

Page 25: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

25

Get a summary of the data

Page 26: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

26

Gather a subset of the data

datasample: from 1980 - 2017

head: first 10000tail: last 10000

Page 27: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

27

Explore the subset of data in MATLAB as you always do

Page 28: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

28

Use the results of explorations to help make decisions

Page 29: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

29

Use the results of explorations to help make decisions

Page 30: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

30

Synchronize all data to daily times

Page 31: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

31

Save the preprocessed data to not have to repeat these steps each time

Page 32: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

32

You don’t need to leave MATLAB to monitor large jobs

Page 33: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

33

Building machine learning models with big data

Access, Preprocessing,

and Exploration

Model Validation and Scaling Up

Model Development

Page 34: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

34

How do you know which model to use?

Try them all ☺

Page 35: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

35

Predict air quality

Air Quality Index Air Quality Label

Regression Classification

Page 36: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

36

Use apps for easy model exploration

Page 37: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

37

Validate and compare models

Page 38: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

38

Select the most important features

Page 39: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

39

Building machine learning models with big data

Access, Preprocessing,

and Exploration

Model Validation and Scaling Up

Model Development

Page 40: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

40

Scale up with tall machine learning models

▪ Linear Regression (fitlm)

▪ Logistic & Generalized Linear Regression (fitglm)

▪ Discriminant Analysis Classification (fitcdiscr)

▪ K-means Clustering (kmeans)

▪ Principal Component Analysis (pca)

▪ Partition for Cross Validation (cvpartition)

▪ Linear Support Vector Machine (SVM) Classification (fitclinear)

▪ Naïve Bayes Classification (fitcnb)

▪ Random Forest Ensemble Classification (TreeBagger)

▪ Lasso Linear Regression (lasso)

▪ Linear Support Vector Machine (SVM) Regression (fitrlinear)

▪ Single Classification Decision Tree (fitctree)

▪ Linear SVM Classification with Random Kernel Expansion (fitckernel)

Page 41: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

41

Big data machine learning models also include goodness of fit measures and convenient functions to explore and validate model

Page 42: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

42

Scale up. But not all at once

Use tall arrays in code

Apply model to subset of data

Apply model to all data

Apply model to new data

Deploy/Compile

Page 43: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

43

Big data without big changes

One file One hundred files

Page 44: Out of Memory? No Problem. · Out of Memory? No Problem. Developing Machine Learning Models on Big Data ... Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for

44