Transcript
Page 1: Want Awesome Models? Build Awesome Training Data!

©2016 RapidMiner, Inc. All rights reserved. - 1 -

TM

Expertly Prepared Data Produces Better Models, Faster!

Tom OttMarketing Data Scientist

RapidMiner

Page 2: Want Awesome Models? Build Awesome Training Data!

©2016 RapidMiner, Inc. All rights reserved. - 2 -

TM

- 2 -

TMToday’s Agenda • Introduction• Challenges of Dirty Data• Data Prep Overview

– Data Exploration– Data Blending– Data Cleansing

• Demo • Q&A

Page 3: Want Awesome Models? Build Awesome Training Data!

©2016 RapidMiner, Inc. All rights reserved. - 3 -

TM

- 3 -

TMUnified Platform Accelerates Time to Value

Data PrepSpeed & optimize ALL

dataexploration, blending

& cleansing tasks

Operationalize

Easily deploy & maintain models and

embed analytic results

Model & Validate

Apply machine learning to rapidly prototype & confidently validate predictive models

Embed results in all types of

business apps & data

visualization tools

Incorporate all types of

data

ACCELERATES TIME TO VALUE

Page 4: Want Awesome Models? Build Awesome Training Data!

©2016 RapidMiner, Inc. All rights reserved. - 4 -

TM

- 4 -

TMData in the Real World is…Dirty• Incomplete: lacking attribute values, lacking

certain attributes of interest, or containing only aggregate data – e.g., occupation=“”

• Noisy: containing errors or outliers – Salary=“-10”, Age=“222”

• Inconsistent: containing discrepancies in codes or names – e.g., Age=“42” Birthday=“03/07/1997” – e.g., Was rating “1,2,3”, now rating “A, B, C” – e.g., discrepancy between duplicate records

Page 5: Want Awesome Models? Build Awesome Training Data!

©2016 RapidMiner, Inc. All rights reserved. - 5 -

TM

- 5 -

TMTime Consuming • Every real world dataset needs some kind of

data pre-processing – Deal with missing values– Correct erroneous values – Select relevant attributes – Adapt data set format to the model type

• In general, data prep or pre-processing consumes greater than 60% of a data science project effort

Page 6: Want Awesome Models? Build Awesome Training Data!

©2016 RapidMiner, Inc. All rights reserved. - 6 -

TM

- 6 -

TMReduces Model Accuracy & Performance

Page 7: Want Awesome Models? Build Awesome Training Data!

©2016 RapidMiner, Inc. All rights reserved. - 7 -

TM

- 7 -

TMIt’s Time to Wrangle Some Data!• Data Exploration

– Discovery through Stats, Charts and Graphs

• Data Blending– Attribute Selection & Generation– Data Types & Conversions– Filters, Sorts & Joins– Sampling

• Data Cleansing – Missing Values– Transformation - Normalization– Outliers– Feature Selection

Page 8: Want Awesome Models? Build Awesome Training Data!

©2016 RapidMiner, Inc. All rights reserved. - 8 -

TM

- 8 -

TM

Demonstration

Page 9: Want Awesome Models? Build Awesome Training Data!

©2016 RapidMiner, Inc. All rights reserved. - 9 -

TM

- 9 -

TMNext Steps• Resources

– RapidMiner Blog: rapidminer.com/resources/blog/– RapidMiner Community: community.rapidminer.com

• On-Demand Demos– Advanced Data Prep: rapidminer.com/resource/advanced-data-prep/– Data Prep Subprocess:

rapidminer.com/resource/creating-data-prep-Subprocess

• Training Videos– Data Exploration: rapidminer.com/training/videos/– Data Prep: rapidminer.com/training/videos/

Page 10: Want Awesome Models? Build Awesome Training Data!

©2016 RapidMiner, Inc. All rights reserved. - 10 -

TM

- 10 -

TM

Contact Us [email protected] @RapidMinerwww.rapidminer.com

Q & A

Discuss Data Prep in the Community

community.rapidminer.com