H2O for non-data scientists Meetup 07.22.15

H2O for Non-Data Scientists

An Introduction to Data Science with H2O with Jessica Lanford and Amy Wang

I’m no data scientist…(tech writer by trade)

…but I can use H2O (and so can you!)

What is H2O?• “Scalable machine learning for

smarter applications”

• Let’s break that down…

• Scalable = as large (or small) as you like

• Machine learning = let the computers do the math (so you don’t have to)

• Smarter applications = learn from your data

The H2O Team

• Sales/Marketing • Engineering/Development

• Operations • Advisors

Why use data science?• Learn from your data to improve your business

• Improve efficiency

• Identify, predict, & prevent supply chain gaps

• Lifehacks

• find better bordeaux

• find better deals on shoes

• improve your chances in your office sports pool

• tonight we’ll learn how to simplify your flight with GLM!

First, let’s download our data: 1. Go to data.h2o.ai 2. Click the airlines/

allyears2k.csv dataset and save the file.

This dataset represents a subset of flight arrival/departure info for all US intercontinental commercial flights (Oct. 1987-Apr. 2008). For more accurate results, please try on your own with the larger dataset at https://s3.amazonaws.com/h2o-airlines-unpacked/airlines_all.05p.csv

https://s3.amazonaws.com/h2o-airlines-unpacked/airlines_all.05p.csv

Now, Let’s Download H2O

1) Go to h2o.ai/download. 2) Click the Simons link. 3) Click the Download H2O button at the top of the screen. 4) After the file downloads, open Terminal (OS X). 5) Type the following (OS X): cd ~/Downloads unzip h2o-3.0.1.2.zip cd h2o-3.0.1.2 java -jar h2o.jar 6) In your browser, enter localhost:54321 You are now running H2O!

Windows Users1) Go to Downloads folder 2) Unzip h2o-3.0.1.2.zip 3) Double-click the h2o

executable jar file. 4) In browser, enter

localhost:54321

Make sure you have Java installed first! https://java.com/en/download/

http://h2o.ai/download

https://java.com/en/download/

Flow Web UIAssist Me

Flow Pro Tips:• To run Flow strictly as a GUI, click

Flow > Toggle All Cell Inputs • View your saved flows by clicking

the Flows tab in the Help sidebar • Use the keyboard shortcuts (Help

> Keyboard Shortcuts

Importing Data1. Click the importFiles link under

Assistance.

2. In the Search field, start typing the location where you saved the airlines dataset. Use the exact file path (e.g., \Users\Home\Downloads\allyears2k.csv for Windows, /Users/<LoginName>/Downloads for OS X)

3. Click to select the airlines dataset, then press the Enter key.

4. Click the Add all link.

5. Click the Import button. • Flow Pro Tip: Find any file available

in H2O using Data > List All Frames

Parsing Data• Why parse?

• Can customize data before creating model

• Confirm data transferred correctly

• Feeds in data in H2O-friendly format

1. Click the Parse these files… button.

2. Review the data in the Edit Column Names and Types section.

3. Click the Parse button.

Parse Setup• Sources: Original file

• Destination key: H2O-friendly file name

• Parser: File type (CSV, XLS, XLSX, etc.)

• Separator: what separates the entries (comma, semicolon, space, etc.)

• Column headers: Is 1st row data or names?

Flow Pro Tip: Save actions and paste them in other sections of your workflow using the Clips feature (the paperclip icon)

What type of model should I use ?• Classification (what category?)

• Binomial (2):Naive Bayes, GLM (logistic regression), DRF, Deep Learning - airlines: delay/no delay

• Multinomial (2+): Deep Learning, GBM, DRF - prediction; identify handwritten numbers

• Regression (what value?)- how much supply to meet demand?

• CoxPH

• GLM

• Dimensionality reduction (many —> few) - narrow # of features to find important ones

• DL Autoencoder

• PCA

• Clustering (what’s similar?) - tracking customer behavior; use when you don’t know what the response will be

• K-MeansFlow Pro Tip: Access each model

type from the Model menu

Data contains values or categories?

Values Categories

How many categories?

2 only 2+

NB, GBM GLM,

DRF, DL GBM,

DRF, DL

GBM GLM,

DRF, DL

What is GLM?

With some problems, we don’t know what the

answer is. But with GLM, you already know what answer you want -

the model helps you figure out how to get

there.

• The dependent variable is the response (the result we want).

For our dataset, this is IsDepDelayed.

• The independent variables are the predictors: the factors that influence the results. For our dataset, these include Origin,

Dest, Distance… • The weights represent how

much each independent variable (factor) influences the

dependent variable (result).

Building Models1. Click the View button after parsing.

2. Click the Build Model button (below allyears2k.hex).

3. Select Generalized Linear Model from the drop-down list.

4. Select allyears2k.hex from the validation_frame drop-down list.

5. Select IsDepDelayed from the Response Column drop-down list.

6. Click the All button in the ignored_columns section, then uncheck the following: - Year- Month- DayofWeek- UniqueCarrier- FlightNum- Origin- Dest- Distance- IsDepDelayed

7. Select binomial from the family drop-down list.

8. Click the Build Model button at the bottom of the page. After the model is complete, click View.

Interpreting ResultsIf your parameters were added correctly, you should see results that look like this.

—>

So what do these results tell us?

Airports at the top of the list are more likely to be delayed (if they have a positive coefficient) or less likely to be delayed (with a negative coefficient). Lynchburg is more likely to be delayed than Corpus Christi Intl.

Coefficient: relationship of feature to response variable; for airlines data, positive = more likely to be delayed, negative = less likely to be delayed

ROC curve = ratio of true positive vs. false positive rate (here, about 50/50 due to smaller dataset - try with the larger dataset to see the difference!)

Getting Help• To get help, click the question mark button (last

button in row), or click Help > Assist Me

• docs.h2o.ai (Select H2O 3.0)

• FAQ

• Flow Web UI (also available in Flow in the “Help” tab in the sidebar

• h2ostream - https://groups.google.com/forum/#!forum/h2ostream

• learn.h2o.ai

http://docs.h2o.ai

http://h2o-release.s3.amazonaws.com/h2o/rel-shannon/26/docs-website/h2o-docs/index.html#FAQ

https://groups.google.com/forum/#!forum/h2ostream

http://learn.h2o.ai

I’m still having trouble…What do I do?

• First, check the FAQ & h2ostream to see if it is already solved

• If not, get the logs:

• Admin > Download Logs

• The logs will help our support team identify the issue faster

• Then send the logs to [email protected] or post them on h2ostream (if you don’t mind sharing your data)

• Can also use Help > Report an Issue to file a JIRA ticket

http://h2o-release.s3.amazonaws.com/h2o/rel-shannon/26/docs-website/h2o-docs/index.html#FAQ

https://groups.google.com/forum/#!forum/h2ostream

mailto:mailto:[email protected]?subject=

Data Science Resources• StackExchange - stats.stackexchange.com

• Kaggle - kaggle.com/wiki/Home (also see Tutorials on kaggle.com)

• Quora - quora.com/Machine-Learning

• StatTrek - stattrek.com

• Data sets - docs.h2o.ai/h2oclassic/resources/publicdata.html

http://kaggle.com/wiki/Home

http://kaggle.com

http://stattrek.com

http://docs.h2o.ai/h2oclassic/resources/publicdata.html

• Regression: output = predict numerical value output; cont. val; input = numerical values; model types: (quantitative)

• Classification: goal is to predict category for response input; output = labels; input = ? model types: (categorical)

• Binomial: only 2 possible outcomes (yes/no)

• Multinomial: 2+ outcomes; model is generated based on outcomes of the specified number of independent trials (outcome of previous trial does not affect next), which have a whole number of possible outcomes with a constant probability

• Response col: Dependent variable - what do you want to measure/observe/predict?

• Predictors: Independent variable - what factors can be manipulated/controlled?

• Validation: How well does the model fit?

• Confusion matrix: Table that depicts algorithm’s performance (usually supervised learning)

• Supervised vs unsupervised: in supervised learning, input is labeled so that algorithm can identify it and learn from it; in unsupervised learning, input data is not labeled

• Hit ratio: number of times prediction was correct out of total predictions

• Iteration: round of model testing/training

• Standardization: Converts variables to standardized variables; helps prevent precision problems

Vocabulary• Seed: Random number generated to control sampling and

initialization

• Offset: Compensates for differences in units of observation (different populations, different geographic sizes) to make sure outcome is proportional

• Weights: A way to specify certain outcomes as more significant than others (for example, if you are trying to identify incidence of disease, one "yes" can be more meaningful than 50 "no" responses). More weight = more importance.

• Trees: Similar to a flowchart - each decision moves you further down the flowchart (tree). Remember the game "20 questions"? If your only choices are "animal", "vegetable", or "mineral", and the responses are "no" for both "animal" and "vegetable", the decision tree will tell you that the answer is "mineral."

• Distribution: proportion, not frequency (number of red candies in a pack not the same every time)

• Balance classes: similar to weights; helps to give more importance to specific responses

• Enum: Categorical variable

• Separator: what separates the entries in the dataset (usually a comma, semicolon, etc.)

• Integer vs real: Integer = whole number (can be negative); real = fractional number

• Binary: See “binomial” (only 2 possibilities)

Documents

H2O for non-data scientists Meetup 07.22.15