Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
2
3
Out of Memory? No Problem. Developing Machine Learning Models on Big Data
Heather Gorr, PhD
MATLAB Product Marketing Manager
4
Big data without big changes
One file One hundred files
5
The big data landscape can seem overwhelming
6
Building machine learning models with big data
Access, Preprocessing,
and Exploration
Model Validation and Scaling Up
Model Development
7
Case study: Predict Air Quality in North America
8
Building machine learning models with big data – step by step
Access, Preprocessing,
and Exploration
Model Validation and Scaling Up
Model Development
9
Historical files are on HDFS and real time data are available through an API
• Temperature• Pressure• Relative Humidity• Dew Point• Wind speed • Wind direction• Ozone• CO• NO2• SO2
10
You have 1TB of data you’ve never seen before. Where do you start?
11
Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for machine learning.
HDFS
YARN
Spark
MATLAB
12
Access and preview the data with datastore
13
Databases
Images
MDF Files
Custom
Simulink
There are numerous datastores to access data in many forms
15
17
Preview the data and adjust properties to best represent the data of interest
18
Use tall arrays to work with the data like any MATLAB array
19
Create a tall array for each datastore
ozone
20
Use familiar MATLAB functions on tall arrays
21
Clean messy data using common preprocessing functions
22
Execution model makes operations more efficient on big data
▪ Deferred evaluation– Commands are not executed right away
– Operations are added to a queue
▪ Execution triggers include:– gather function
– summary function
– Machine learning models
– Plotting
23
Execution model makes operations more efficient on big data
Unnecessary results are not computed
24
Explore the data with tall visualizations
plot
scatter
binscatter
histogram
histogram2
ksdensity
25
Get a summary of the data
26
Gather a subset of the data
datasample: from 1980 - 2017
head: first 10000tail: last 10000
27
Explore the subset of data in MATLAB as you always do
28
Use the results of explorations to help make decisions
29
Use the results of explorations to help make decisions
30
Synchronize all data to daily times
32
You don’t need to leave MATLAB to monitor large jobs
33
Building machine learning models with big data
Access, Preprocessing,
and Exploration
Model Validation and Scaling Up
Model Development
34
How do you know which model to use?
Try them all ☺
35
Predict air quality
Air Quality Index Air Quality Label
Regression Classification
36
Use apps for easy model exploration
37
Validate and compare models
38
Select the most important features
39
Building machine learning models with big data
Access, Preprocessing,
and Exploration
Model Validation and Scaling Up
Model Development
40
Scale up with tall machine learning models
▪ Linear Regression (fitlm)
▪ Logistic & Generalized Linear Regression (fitglm)
▪ Discriminant Analysis Classification (fitcdiscr)
▪ K-means Clustering (kmeans)
▪ Principal Component Analysis (pca)
▪ Partition for Cross Validation (cvpartition)
▪ Linear Support Vector Machine (SVM) Classification (fitclinear)
▪ Naïve Bayes Classification (fitcnb)
▪ Random Forest Ensemble Classification (TreeBagger)
▪ Lasso Linear Regression (lasso)
▪ Linear Support Vector Machine (SVM) Regression (fitrlinear)
▪ Single Classification Decision Tree (fitctree)
▪ Linear SVM Classification with Random Kernel Expansion (fitckernel)
41
Big data machine learning models also include goodness of fit measures and convenient functions to explore and validate model
42
Scale up. But not all at once
Use tall arrays in code
Apply model to subset of data
Apply model to all data
Apply model to new data
Deploy/Compile
43
Big data without big changes
One file One hundred files
44