Upload
jonas-blair
View
217
Download
4
Embed Size (px)
Citation preview
kamath 11
Scientific Data Mining
Chandrika Kamath
October 7, 2008
Lawrence Livermore National Laboratory
kamath 22
Goal: solving the problem of data overload
• Use scientific data mining techniques to analyze
data from various SciDAC applications
• Techniques borrowed from image and video
processing, machine learning, statistics, pattern
recognition, …
• Leveraging the Sapphire scientific data mining
software, with functions added as required
• Contributors to the SciDAC part: Erick Cantú-Paz,
Imola K. Fodor, Siddharth Manay, Nicole S. Love
kamath 33
Overview of Sapphire
kamath 44
Sapphire: scientific data mining(1998-2008)
• We analyze science data from experiments,
observations, and simulations: massive *and* complex
• Sapphire has a three-fold focus
• research in robust, accurate, scalable algorithms
• modular, extensible software
• analysis of data from practical problems
• Funded through DOE NNSA, LLNL LDRD, SDM SciDAC
Center, GSEP SciDAC project
https://computation.llnl.gov/casc/sapphire
kamath 55
Scientific data mining - from a Terabyte to a Megabyte
Raw Data
Target Data
Preprocessed Data
Transformed Data
Patterns Knowledge
Data Preprocessing Pattern Recognition
Data FusionSamplingMulti-resolution analysis
De-noisingObject - identificationFeature- extractionNormalization
Dimension- reduction
ClassificationClusteringRegression
Interpreting Results
VisualizationValidation
An iterative and interactive process
kamath 66
De-noise dataBackground- subtraction
Identify objectsExtract features
Sample dataFuse data
Multi-resolution-analysis
Dataitems
Features
RDB: Data Store
NormalizationDimension-reduction
Decision trees
Neural Networks
SVMs
k-nearestneighbors
Clustering
Evolutionary algorithms
Tracking
….
FITSBSQPNMView. . .
DisplayPatterns
User Input & feedback
SapphireSoftware
Public Domain Software
Sapphire & Domain Software
Components linked by Python
The Sapphire system architecture: flexible, portable, scalable
US Patents 6675164 (1/04), 6859804 (2/05), 6879729 (4/05), 6938049 (8/05), 7007035 (2/06), 7062504 (6/06)
kamath 77
The modular software is used to meet the needs of different applications
Sapphire Software
Rem
ote Sensing
Astronomy
Climate Sim
ulations
Video surveillance
Plasma Physics Frag
men
tatio
n of
mat
eria
ls
Sapphire librariesScientific data processing,
dimension reduction, pattern recognition
Drivers,support functions
Drivers,support functions
Graphical Interface
Command-lineInterface
…
…
Flu
id m
ix,
turb
ulen
ce
Sim/E
xpt c
ompariso
n
In this talk, I focus only on SciDAC applications
kamath 88
SciDAC achievements
kamath 99
Application 1: Separating signals in climate data
• We used independent component analysis to separate
El Niño and volcano signals in climate simulations
• Showed that the technique can be used to enable
better comparisons of simulations
Collaboration with Ben Santer (LLNL)
kamath 1010
Application 2: Identifying key features for EHOs in DIII-D
• We used dimension reduction techniques from
statistics and machine learning to identify key
features associated with edge harmonic
oscillations in the DIII-D tokamak
• H-mode is the preferred mode of operation, but
associated with ELMs – which can damage
components of the tokamak
• A quiescent H-mode has been observed; associated
with EHOs – need to understand EHOs better
• The key variables identified are being used to
understand the cause of EHOs; the software has
been licensed to GAT
Collaboration with Keith Burrell and Mike Walker (GAT)
kamath 1111
The data is from sensors in DIII-D
• 700 experiments, each lasting 6 seconds
• Each 50ms window of an experiment is assigned a low
or high EHO-ness label
• Each window is described by 37 sensor measurements
• Data cleanup
• discard windows with at least one missing sensor value
• use median value of variable in window
• discard windows with at least one variable in the top or bottom
percentile of its range
• resulted in 41818 instances
kamath 1212
Challenge: no preconceived notion of which sensor values are important• Data cleanup: prevents outliers from influencing results
• Use different feature selection methods to gain
confidence
• PCA filter – use magnitude of coefficients
• Distance filter – Kullback-Liebler distance between histograms
• Stump filter
• Chi-square filter
• Boosting approach
• Introduce a “noise” feature
kamath 1313
We evaluated the features using a naïve Bayes classifier
kamath 1414
We also considered the top ten features selected by the methods
PCA filter Distance Chi-
squareStump Boosting
F33 F27 F25 F25 F27
F32 F25 F27 F7 F25
F6 F9 F7 F27 F9
F23 F26 F29 F26 F14
F17 F30 F9 F30 F8
F12 F29 F26 F9 F26
F20 F7 F6 F19 F6
F21 F8 F8 F29 F22
F10 F14 F30 F31 F29
F11 F6 F19 F8 F23
kamath 1515
Several features are common across different methods
PCA filter Distance Chi-
squareStump Boosting
F33 F27 F25 F25 F27
F32 F25 F27 F7 F25
F6 F9 F7 F27 F9
F23 F26 F29 F26 F14
F17 F30 F9 F30 F8
F12 F29 F26 F9 F26
F20 F7 F6 F19 F6
F21 F8 F8 F29 F22
F10 F14 F30 F31 F29
F11 F6 F19 F8 F23Multiple methods provide confidence in results
kamath 1616
Application 3: Classifying and characterizing orbits in Poincaré plots
• I am using techniques from scientific data mining
to assign one of four labels to an orbit and
extract characteristics of separatrix and island
chain orbits.
Collaboration with J. Breslau, N. Pomphrey, D. Monticello(PPPL), S. Klasky(ORNL)
kamath 1717
There are four classes of orbits – based on the location of the initial point
Quasi-periodicIsland chain
SeparatrixStochastic
kamath 1818
Challenge: There is a large variation in the orbits of any one class
quasiperiodic orbits
kamath 1919
Variation in island-chain orbits
kamath 2020
Variation in separatrix orbits
5000
points
1000
points
kamath 2121
How do we extract representative features for an orbit? • Variation in the data makes it difficult to identify
good features and extract them in a robust way
• Issues with labels assigned to orbits
• Next steps: characterizing island chains and
separatrix orbits
Identifying missing orbits
kamath 2222
Application 4: Tracking blobs in fusion plasma
• We are using image and video processing techniques to
identify and track blobs in experimental data from NSTX
to validate and refine theories of edge turbulence
t t+1 t+2
Denoisedoriginal
After removal of background
Detection of blobs
Collaboration with S. Zweben, R. Maqueda, and D. Stotler (PPPL)
kamath 2323
Goal: understand the turbulence which causes leakage of the plasma
• Requirements for fusion – high temperature and
confined plasma
• Fine-scale turbulence at the edge causes leakage of
plasma from the center to the edge
• Loss of confinement
• Heat loss of plasma
• Erosion or vaporization of the containment wall
kamath 2424
The Gas-Puff Imaging diagnostic is used to view the coherent structures
• Turbulence in the form of density filaments highly
elongated in the direction of the magnetic field
• Inject a gas cloud in the torus, and capture the
intersection of the cloud with the filament using a
camera which views the filament along the magnetic
fieldGPIview
16x32 cm
kamath 2525
Data from GPI in NSTX
• PSI-5 camera capture GPI images
• 300 frame sequences taken at 250,000 frames/sec
• 16-bit images with 64x64 pixels
kamath 2626
Why is this difficult?
• coherent structures are poorly understood
empirically and not understood theoretically
• no known ground-truth
• noisy images
• variation within a sequence
kamath 2727
Example frames to segment (sequence 113734: frames 1-50)
kamath 2828
We are investigating several image segmentation methods
• Immersion-Based: basic immersion, constrained
watershed, watershed merging
• Region Growing: seeded region growing, seed
competition
• Model-Based: 2-D Gaussian fit
• Challenges: how do we select the parameters in
an algorithm, how do we handle the variability in
the data especially for longer sequences, how do
the choices of algorithms and parameters
influence the “science”, …
Ongoing work: see AHM 2007 slides
kamath 2929
Vision for the future
• Meeting algorithm requirements of current
applications
• Robust extraction of feature vectors (orbit characterization)
• Improved algorithms for image analysis (blob characterization)
• Uncertainty quantification (how much can we trust the result?)
• Meeting the science goals
• Classification and characterization of Poincaré plots
• Tracking the blobs in NSTX
• Extraction of coherent structures in fluid and particle data and
their non-linear interactions (GSEP)
• Addressing requests from new applications – SNS,
materials science, combustion, power grid, …
• Deploy as requested