Upload
jo-fai-chow
View
787
Download
1
Embed Size (px)
Citation preview
MUNGING, MODELING,AND PIPEL INES USING PYTHON
Hank Roark
COMMUNITY FEEDBACK
Pythonic Interface to H2O, R interface parity
Rapid learning and iteration
Leverage existing knowledge and skills
Interface cleanly with PyData ecosystem
More Environments, esp. PySpark
Python Pipelines to Production
EXAMPLE FROM THE IOTDomain: Prognostics and Health ManagementMachine: Turbofan Jet EnginesData Set: A. Saxena and K. Goebel (2008). "Turbofan Engine Degradation Simulation Data Set", NASA Ames Prognostics Data Repository
Predict Remaining Useful Life from Partial Life Runs
Six operating modes, two failure modes, manufacturing variability
Training: 249 jet engines run to failureTest: 248 jet engines
WHY THIS EXAMPLE?
GETTING READY FOR BRONTOBYTES
LOADING DATA
SUMMARY STATISTICS
FEATURE ENGINEERING
Calculate Total CyclesFor Each Unit
FEATURE ENGINEERING
Append To OriginalFrame
FEATURE ENGINEERING
Create New Feature of Cycles
Remaining
EXPLORATORY DATA ANALYSISBoolean Indexing
EXPLORATORY DATA ANALYSISSample thedata to local
memory
EXPLORATORY DATA ANALYSIS
Use yourfavorite
visualizationtools
(Seaborn!)
Ugh, where are
trendsover time
Time
ZeroRemainingUsefulLife
MODEL BASED DATA ENRICHMENTSensor
measurementsappear inclusters
Correspondingto operating
mode!
MODEL BASED DATA ENRICHMENT
Use H2O k-means to find cluster
centers
MODEL BASED DATA ENRICHMENT
Enrich existing datawith operating mode
membership
MORE FEATURE ENGINEERINGFor non-constant
sensor measurements
within an operating mode,
Standardize each sensor measurement
by operating mode
Based on thetraining data
TRENDS OVER TIME!
Before H2O Munging
Ready for H2O Learning
Time Time
MODELING
Configure anEstimator
MODELING
Train an Estimator
MODEL EVALUATIONEvaluate Performance
at a glancein Python
MODEL EVALUATIONEvaluate Performance
at a glancein H2O Flow
MODEL EVALUATIONEvaluate Performance
at a glancegraphically in Python
CROSS VALIDATION
SetupHyperparameterSearch Options
CROSS VALIDATION
Configurefull full
grid search
CROSS VALIDATION
Executegrid search
CROSS VALIDATION
Evaluate results &model selection
MORE CONTROL – SCIKIT PIPELINES
Create Pipelines
Hyperparameter Options
Cross validation strategy
HyperparameterSearch Strategy
Fit
DATA PIPELINES USING H2OASSEMBLY
TypicalData Preparation
Add some structure
H2OASSEMBLY TO PRODUCTION
Javafor
ProductionScoring
Python
MORE ENVIRONMENTS
PySparkling Water = Python + Spark + H2O
Python + Sparkling Water
COMMUNITY FEEDBACK
Pythonic Interface to H2O, R interface parity
Rapid learning and iteration
Leverage existing knowledge and skills
Interface cleanly with PyData ecosystem
More Environments, esp. PySpark
Python Pipelines to Production
RESULTSH2O Python Framework:
H2OFrame & H2OEstimators
H2OAssembly for Data Prep Pipelines
Python, Jupyter Notebooks,Pandas, Scikit-Learn Integration
PySparkling Water
RESOURCES
• Python booklet• Tibshirani release• Python documentation• Github examples• Jupyter Notebook of Example
THANK YOU