31
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Sensors & Knowledge Discovery (a.k.a. Data Mining)

H. Scott Matthews

April 14, 2003

Page 2: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Recap of Last Week

Sensors - what are they?

Sensor Networks - how they help us

Sensor Signal Acquisition and Use

Effects of Digital, analog conversions

Range, power, frequency, other constraints

Next - how to use the data!

Page 3: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Life Cycles of Sensor Networks

Currently, sensors and sensor systems are fairly proprietary

e.g. a ‘Johnson Controls’ HVAC sensor system uses only their equipment

Need to design more robust networks that are standards-driven and open

Page 4: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Life Cycles (2)

In addition, sensor networks then to have very short ‘lifetimes’

i.e. We build one, use it for a few years, and then replace it with a newer/better one

Need to plan for, and design architectures for sensor networks that will last the life of the infrastructure we are monitoringe.g. 50-100 years for bridges (to manage LCC)

Page 5: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

A Knowledge Discovery Framework for Civil Infrastructure Contexts

Rebecca BuchheitDepartment of Civil and Environmental Engineering

Carnegie Mellon University

Page 6: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Motivation• condition and usage patterns of critical

infrastructure attracting increased attention

• deteriorating infrastructure + cheap data collection methods = health monitoring, transportation management, other data intensive civil infrastructure techniques

Page 7: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Motivation

• amount of data, relationships between attributes, context-sensitivity, observational collection methods => data mining and knowledge discovery in databases (KDD) process

• our ability to collect data far outstrips our ability to analyze and understand the data at a high level of abstraction

Page 8: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Databases + Statistics + and Machine Learning = Data Mining

databasesstatistics

machine learning

data mining

Page 9: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Definitions

Data Mining• algorithms to extract patterns from large

data sets

Knowledge Discovery in Databases• “... the non-trivial process of identifying

valid, novel, potentially useful, and ultimately understandable patterns in data.” [Fayyad, et al]

• Uses observational, not controlled, data

Page 10: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Knowledge Discovery Process Steps

domain understanding

data understanding

data preparation

data modeling (a.k.a “data mining”)

results evaluation

deployment

Page 11: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

CRISP-DM

CRoss-Industry Standard Process for Data Mining

high-level, hierarchical, iterative process model for KDD

provides framework for applying KDD consistently

Page 12: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Domain Understanding

evaluate fit between KDD and the problem• how much data?• what type of data?• perceived quality of data?• what is being measured?• right data to answer the question?• organizational support?

Page 13: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Data Understanding

summary statisticsplotting and visualizationmissing values

• randomly missing• influenced by a measured factor• influenced by an unmeasured factor

evaluate quality of existing data• what is “good” data?• what do we do with “bad” data?

Page 14: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Data Preparation

most time-consuming part of KDDdata selection

• which records (“rows”) to use• which attributes (“columns”) to use

data cleaning• do something to bad and missing data

integrate data from different sourcestransform data

Page 15: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Data Modeling/Data Mining

choose an algorithm• choose parameters for that algorithm• apply algorithm to data• evaluate results

– predictive accuracy– descriptive coverage

• repeat as necessary

repeat as necessary

Page 16: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Data Mining Goals

Prediction• predict the value of one or more variables

based on the values of other variables

Description• describe the data set in a compact, human-

understandable form

Page 17: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Data Mining Tasks

• Classification

• Regression

• Clustering

• Deviation detection

• Summarization

• Dependency modeling

Page 18: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Classification

learn how to classify data items into predefined groups

Page 19: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Regression

map a real-valued dependent variable to one or more independent variables

Page 20: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Clustering

learn “natural” classes or clusters of data

Page 21: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Deviation Detection

detect changes or deviations from “normal” or baseline state

Page 22: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Summarization

summarize subsets of data set

computer industrymean salary = $65kservice industrymean salary = $20k

Page 23: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Dependency Modeling

learn relationships between attributes or between items in the data set• pattern recognition• time series analysis• association rules

In 80% of the cases, an engineer with a PE and 10 years experience is a project manager.

Page 24: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Data Mining in the IW

concept description using classificationenvironmental conditions affect hot water

energy consumption • used outside temperature, solar radiation and wind

speed • solar radiation and wind speed not significant

above 80F and below 50F • IF temperature between 20F and 30F THEN energy usage between 47,393 kJ and

131,875 kJ • describes >50% instances in energy usage range

Page 25: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Results Evaluation

do results meet client’s criteria?

novel?

understandable?

valid (modeling phase)?

useful?

Page 26: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Results Deployment

explain results to client

improvements to data collection?

ongoing process applied to new data?

Page 27: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Benefits of KDD

Intelligent Workplace• confirmation that system is (not) working• continue to monitor control system• in future, predict missing values to complete

energy studies

Page 28: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Apply Data Mining to Civil Infrastructure?

• civil infrastructure meets guidelines for selecting potential data mining problems• significant impact• no good alternatives exist• prior/domain knowledge• effects of noisy data are mitigated• sufficient data• relevant attributes are being measured

Page 29: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Background• sporadic use of KDD techniques in civil

infrastructure• relative youth of data mining research• difficult to systematically apply KDD process • KDD process tools (CRISP-DM) still under

development• KDD process highly domain dependent• time consuming to teach data mining analysts

domain knowledge

Page 30: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Research Objectives• develop a framework for systematically

applying KDD process to civil infrastructure data analysis needs• set of guidelines for inexperienced analysts• checklist for more experienced analysts

• describe intersection of KDD process characteristics and civil infrastructure• what problems are well-suited to KDD?• what characteristics are unique to

infrastructure?

Page 31: Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003

Civil and Environmental Engineering Carnegie Mellon University

Summary

• increased data collection => increased need to intelligently analyze data

• KDD process as a “power tool” for analyzing data for high-level knowledge

• civil infrastructure problems are well-suited to data mining but will need to apply entire KDD process to get good results

• proposed framework will help researchers to systematically apply KDD process to their data analysis problems