40
Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Embed Size (px)

Citation preview

Page 1: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Data Mining Techniques in Support of Science Data

Stewardship

Eric A. Kihn, M. Zhizhin

NOAA/NGDC RAS/CGDS

Page 2: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Presentation outline

I. Background for the talkII. What is science data stewardship?III. What is data mining?IV. Techniques for SDSIV. Conclusions

Page 3: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Motivation for this presentation?

• Present an innovative technology application to a new community

• Show different methods for accessing and characterizing data

• Present some interesting results in area of employing intelligent systems to support environmental data archives

Page 4: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

What is being presented

• A set of tools and techniques developed or utilized at the National Geophysical Data Center

• A system meant to mimic the expertise of a subject matter expert (SME)

• Some key concepts such as fuzzy logic, data mining, knowledge tools.

Page 5: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Nature June 10, 1999• “It’s sink or swim as a tidal wave of data

approaches. … Are scientists ready for the flood?”• “Most researchers are accustomed to studying a

relatively small data set for a long time, using statistical models to tease out patterns. At some fundamental level that paradigm has broken down.”

• NASA’s EOS exceeds 1 Tb/Day• CERN exceeds 20 Tb/Day• The internet as a distributed data source provides

100’s of petabytes.

Page 6: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Ph.D’s and Networked Data

• The number of eyes looking at data remains constant.

• The amount of data tends to follow Moore’s law.

• In order to turn data into knowledge new techniques are required.

Nature June 10, 1999

Page 7: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

NOAA NATIONAL DATA CENTERSNOAA NATIONAL DATA CENTERS

Page 8: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

NGDC Holdings - % Mbytes by Data NGDC Holdings - % Mbytes by Data TypeType

DMSP SATELLITEDATA 97%

ALL OTHER 3%

Data archived as of September 2002Data archived as of September 2002

SIDE SCAN SONAR

16% GEOMAGNETISM 10%

MARINE TRACKLINE+ OTHER MARINE

10%

ECO SYSTEMS9%

BATHYMETRY,TOPOGRAPHY, & RELIEF

8%

SOLAR 7%

HAZARDS 2%

MARINE GEOLOGY 1%

LAND GRAVITY < 1%

LAND GEOCHEMISTRY <1%

COSMIC RAY < 1%

AURORA <1%

LAND GEOTHERMAL < 1%

SOLAR-TERRESTRIALPUBLICATIONS < 1%

IONOSPHERIC 12%

SATELLITE - GOES, NOAA TIROS 24%

Page 9: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

What is science data stewardship?

Page 10: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Why the emphasis on data mining now?Answer: Layers of data archives

Data Collection

Data Archive

Data Warehouse

Data Collection

Data Archive

Data Warehouse

Data Mart• Standard Metadata• Access methods (i.e. XML)

• Enterprise organization of data

• Data quality control• Local holdings

Page 11: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Levels of Information Analysis

OLAP&

Reporting

AdvancedAnalysis

Bet

ter

Info

rmat

ion

• Simulation/Optimization• Forecasting• Segmentation• Model Building• Hypothesis Testing• Statistics• Conditional Climatology• Visualization• Climatology• Percentages• Counts & Sums• Queries

Page 12: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Research

Quality DataResearch

Quality Data

ProductizationProductization

Processing, CalibrationProcessing, Calibration

Collection and StorageCollection and Storage

ScientistsScientistsQuality Control

Quality Control

TechniquesTechniques

Raw DataRaw Data

UsersUsers

Skilled Skilled

UsersUsers

Mission Mission

ScientistsScientists

UserUserRequirements

Requirements

KnowledgeKnowledgeSocietySocietyAnalysis

Analysis

Page 13: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

What is data mining?

Page 14: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Definition of Data Mining

Data mining (also known as Knowledge Discovery in Databases - KDD) has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data"[1] It uses machine learning, statistical and visualization techniques to discovery and present knowledge in a form which is easily comprehensible to humans.

Page 15: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS
Page 16: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Application to Environmental Data

•Data quality control•Human linguistic translation•Event and trend detection•Data classification•Forecast•Deviation detection

Page 17: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Categories of Knowledge Tools

• Reporting and OLAP

• Theory driven modeling:– Correlations – t-tests – ANOVA – Linear Regression – Logistic Regression – Discriminant Analysis – Forecasting Methods

• Data driven modeling:– Cluster Analysis – Factor Analysis – Decision Trees – Fuzzy Classifier – Neural Networks – Association rules – Rule induction

2-D Fuzzy C-Means Clustering

Page 18: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Why Fuzzy Logic?

Fuzzy logic is a superset of conventional (Boolean) logic that has been extended to handle the concept of partial truth -- truth values between "completely true" and "completely false". It was introduced by Dr. Lotfi Zadeh of UC/Berkeley in the 1960's as a means to model the uncertainty of natural language.

Natural language (like most other activities in life and indeed the universe) is not easily translated into the absolute terms of 0 and 1. Fuzzy logic lets computers function closer to the way our brains work. We aggregate data and form a number of partial truths which we aggregate further into higher truths which in turn, when certain thresholds are exceeded, cause certain further results such as motor reaction.

Page 19: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Fuzzy-Logic

• Jim is 5’2” (157 cm) tall. Is Jim tall?– Boolean Logic - “NO” (0)– Fuzzy-Logic - “Jim is .082 tall” (.082)

• Major Advantages:– Allows more realistic (natural) definition of sets– More graceful handling of boundaries/intersections– Provides more human-like searching

• Fuzzy-Logic does NOT impact the data. It is simply a classification technique for selecting the most relevant data, given a set of complex conditions.

Page 20: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Definition of a fuzzy set

Fuzzy set A in X is asa set of ordered pairs

,, XxxxA A

10 xAdefined by membershipfunction

Classical set A in X isa set of ordered pairs

,, XxxIxA A

defined by indicatorfunction 1,0xI A

Page 21: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Fuzzy logic

First operand: fuzzy set A

Second operand: fuzzy set B

Fuzzy NOT

Fuzzy AND

Fuzzy OR

AA 1

BABA ,min

BABA ,max

Page 22: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

January Wind Speed Record

0

5

10

15

20

1/1/97 1/6/97 1/11/97 1/16/97 1/21/97 1/26/97 1/31/97

Date

Win

d S

pee

d (

kts)

January Temperature Record

0

5

10

15

20

25

30

1/1/97 1/6/97 1/11/97 1/16/97 1/21/97 1/26/97 1/31/97

Date

Tem

per

atu

re (

deg

C)

January Relative Humidity Record

0

20

40

60

80

100

1/1/97 1/6/97 1/11/97 1/16/97 1/21/97 1/26/97 1/31/97

Date

Rel

. Hu

mid

ity

(%)

“High” Wind

“Average”Temperature

“About” 60%Humidity

Page 23: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

List of Events

Page 24: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

What is fuzzy clustering?

• In non-fuzzy or hard clustering, data is divided into crisp clusters, where each data point belongs to exactly one cluster.

• In fuzzy clustering, the data points can belong to more than one cluster, and associated with each of the points are membership grades which indicate the degree to which the data points belong to the different clusters.

Page 25: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Types of Fuzzy Cluster Algorithms

Classical Fuzzy Algorithms (cummulus like clusters)

The fuzzy c-means algorithmThe Gustafson-Kessel algorithmThe Gath-Geva algorithmMountain and Subtractive

Linear and Ellipsodial (lines)

The fuzzy c-varieties algorithmThe adaptive clustering algorithm

Shell (circles,ellipses, parabolas)

Fuzzy c-shells algorithmFuzzy c-spherical algorithmAdaptive fuzzy c-shells algorithm

Page 26: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Mountain fuzzy clustering algorithm

• Form a grid on the data space; intersections are candidates for cluster centers

• Construct mountain function representing data density• Sequentially destruct the mountain function:

– Make dent where highest values are

(each data point contributes

to the height)

• Subtracted amount inversely proportional to distance between v and c1 and height m(c1)

N

i

xv i

evm1

2 2

2

)(

2

21

2

1)()()(cv

new ecmvmvm

Page 27: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

2D density mountains

Mountain function with b) σ=0.05 c) 0.1 d) 0.2

0 0.5 10

0.2

0.4

0.6

0.8

1(a)

0

0.5

1

0

0.5

1

10

20

30

(b)

0

0.5

1

0

0.5

1

20

40

60

(c)

0

0.5

1

0

0.5

1

20406080

(d)

Page 28: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

2D mountain clustering

0

0.5

1

0

0.5

1

10

20

30

(a)

0

0.5

1

0

0.5

1

10

20

30

(b)

0

0.5

1

0

0.5

1

10

20

30

(c)

0

0.5

1

0

0.5

1

10

20

30

(d)

Mountain destruction with β=1 b) first cluster c) second d) third

Page 29: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Mountain fuzzy clustering

• No need to set number of clusters a priori• Simple, but computationally expensive• May be used to generate fuzzy rules relating the

variables (knowledge discovery)• May be generalized to subtractive clustering

Yager, R. and D. Filev, "Generation of Fuzzy Rules by Mountain Clustering," Journal of Intelligent & Fuzzy Systems, Vol. 2, No. 3, pp. 209-219, 1994.

Page 30: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Subtractive clustering

The method assumes each data point is a potential cluster center and calculates a measure of the likelihood that each data point would define the cluster center, based on the density of surrounding data points:

• Selects the data point with the highest potential to be the first cluster center

• Removes all data points in the vicinity of the first cluster center, in order to determine the next data cluster and its center location

• Iterates on this process until all of the data is within radii of a cluster center

Page 31: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Subtractive clustering advantages

• No grids in the parameter space: computationally efficient

• Fuzzy clusters centered at the observation points: real modes selection

• May be used to generate fuzzy rules relating the variables (knowledge discovery)

Chiu, S., "Fuzzy Model Identification Based on Cluster Estimation," Journal of Intelligent & Fuzzy Systems, Vol. 2, No. 3, Sept. 1994

Page 32: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Techniques for SDS

Page 33: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Data Quality Control

The Space Weather Reanalysis a long term re-analysis requires careful quality control of a huge volume of data. A single instance of bad data can have ripple effects throughout the entire model run. Working with satellite and station data in particular can be tricky, with spikes, baseline shifts, dropouts all prominent in the data stream. In a typical small scale study it would be possible for a researcher to hand screen the data, but here the volume requires the application of “intelligent” computer techniques, based on fuzzy-logic, neural computing and other mathematical functions.

Sample station data used in the SWR effort.

Page 34: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Some Preliminary Results

Scenario: Boulder for Mid October

Parameters Studied: Temperature (surface), Relative Humidity

Impacts: Scenario represents likely impacts on an IR sensor instrument.

Data Source: NCEP Reanalysis (20 years)

Technique: Subtractive Clustering

Page 35: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Visual Standard Output

Page 36: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Boulder 00 UT Rh vs. T

265 270 275 280 285 290 295 3000

10

20

30

40

50

60

70

80

90

100

Temp, K

Rel

Hum

id,

%

00:00

Page 37: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Boulder Rh vs. T 20 Year Composite

250 260 270 280 290 3000

50

100

150

0

5

10

15

20

RelHumid, %

Temp, K

Day

tim

e, h

Page 38: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Boulder Derived Modes (subtractive clustering)

265270

275280

285290

0

20

40

60

80

100

0

5

10

15

20

RelHumid, %

Temp, K

Day

tim

e, h

Page 39: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Conclusions• Increasing data volumes demand new tools and

methods• Mathematical methods exist which, provide

analysis, classification and forecast methods for large data volumes

• Fuzzy based systems hold great promise as knowledge extraction tools.

Page 40: Data Mining Techniques in Support of Science Data Stewardship Eric A. Kihn, M. Zhizhin NOAA/NGDC RAS/CGDS

Resources

BooksFuzzy Cluster Analysis, Hoppner et alNeuro-Fuzzy and Soft Computing, Jang et alFuzzy Logic , YenSystem Identification, Ljung

Web The Environmental Scenario Generator http://esg.ngdc.noaa.gov

Data Mining and Knowledge Discoveryhttp://www.digimine.com/usama/datamine/

An excellent introductory article is: Bezdek, James C, "Fuzzy Models --- What Are They, and Why?", IEEE Transactions on Fuzzy Systems, 1:1, pp. 1-6, 1993.