Upload
pascal
View
20
Download
0
Tags:
Embed Size (px)
DESCRIPTION
An Interoperable Framework for Mining and Analysis of Space Science Data (F-MASS). PI: Sara J. Graves Project Lead: Rahul Ramachandran Information Technology and Systems Center University of Alabama in Huntsville [email protected] [email protected]. http://www.itsc.uah.edu. - PowerPoint PPT Presentation
Citation preview
An Interoperable Framework for An Interoperable Framework for Mining and Analysis of Space Mining and Analysis of Space
Science Data (F-MASS)Science Data (F-MASS)
PI: Sara J. Graves PI: Sara J. Graves Project Lead: Rahul RamachandranProject Lead: Rahul Ramachandran
Information Technology and Systems CenterInformation Technology and Systems CenterUniversity of Alabama in HuntsvilleUniversity of Alabama in Huntsville
[email protected]@itsc.uah.edu [email protected]@itsc.uah.edu
http://www.itsc.uah.edu
Others Involved in the ProjectOthers Involved in the Project
Wladislaw Lyatsky and Arjun Tan (Co-PI)Wladislaw Lyatsky and Arjun Tan (Co-PI)
Department of Physics, Alabama A&M Department of Physics, Alabama A&M UniversityUniversity
Glynn Germany Glynn Germany
Center for Space Plasma, Aeronomy, and Center for Space Plasma, Aeronomy, and Astrophysics Research, University of Alabama in Astrophysics Research, University of Alabama in HuntsvilleHuntsville
Xiang Li, Matt He, John Rushing and Amy LinXiang Li, Matt He, John Rushing and Amy Lin
ITSC, University of Alabama in HuntsvilleITSC, University of Alabama in Huntsville
Extend the existing scientific data mining framework Extend the existing scientific data mining framework by providing additional data mining algorithms and by providing additional data mining algorithms and customized user interfaces appropriate for the customized user interfaces appropriate for the space science research domainspace science research domain Provide a framework for mining to allow better data Provide a framework for mining to allow better data
exploitation and use exploitation and use
Utilize specific space science research scenarios Utilize specific space science research scenarios as use case drivers for identifying additional as use case drivers for identifying additional techniques to be incorporated into the frameworktechniques to be incorporated into the framework Enable scientific discovery and analysisEnable scientific discovery and analysis
Project ObjectivesProject Objectives
Overview of the “New” Mining Overview of the “New” Mining Framework Framework
Case Study: Comparing Different Case Study: Comparing Different Thresholding Algorithms for Thresholding Algorithms for Segmenting Auroras Segmenting Auroras
Presentation OutlinePresentation Outline
ADaM 4.0ADaM 4.0 Algorithm Development and MiningAlgorithm Development and Mining
Mining System: Design Mining System: Design ObjectivesObjectives
Ease of Use!Ease of Use!
Reusable ComponentsReusable Components
Simple Internal Data ModelSimple Internal Data Model
Allow both loose and tight coupling Allow both loose and tight coupling
with other applications/systemswith other applications/systems
Flexible to allow ease of use in both Flexible to allow ease of use in both
batch and interactive modebatch and interactive mode
Mining System DesignMining System Design
VIRTUAL REPOSITORY OF OPERATIONS
DATA MINING IMAGE PROCESSING
TOOLKIT TOOLKIT
OPERATIONS
PROVIDE MINING OPERATIONS AS WEB SERVICES
BUILD GENERIC APPLICATIONS
USE OPERATIONS AS STAND ALONE
EXECUTABLES
BUILD CUSTOMIZED APPLICATIONS
Each component is provided with a Each component is provided with a C++ application programming C++ application programming interface (API), an executable in interface (API), an executable in support of scripting tools (e.g. Perl, support of scripting tools (e.g. Perl, Python, Tcl, Shell) Python, Tcl, Shell)
ADaM components are lightweight ADaM components are lightweight and autonomous, and have been and autonomous, and have been used successfully in a grid used successfully in a grid environmentenvironment
ADaM has several translation ADaM has several translation components that provide data level components that provide data level interoperability with other mining interoperability with other mining systems (such as WEKA and systems (such as WEKA and Orange), and point tools (such as Orange), and point tools (such as libSVM and svmLight)libSVM and svmLight)
ADaM also includes Python wrappersADaM also includes Python wrappers
ADaM toolkit is available to all ADaM toolkit is available to all
ADaM 4.0 ComponentsADaM 4.0 Components
ADaM Example: Classification ProcessADaM Example: Classification Process
Identify potential features which may characterize Identify potential features which may characterize the phenomenon of interestthe phenomenon of interestGenerate a set of training instances where each Generate a set of training instances where each instance consists of a set of feature values and the instance consists of a set of feature values and the corresponding class labelcorresponding class labelDescribe the instances using ARFF file formatDescribe the instances using ARFF file formatPreprocess the data as necessary (normalize, Preprocess the data as necessary (normalize, sample etc.)sample etc.)Split the data into training / test set(s) as Split the data into training / test set(s) as appropriateappropriateTrain the classifier using the training setTrain the classifier using the training setEvaluate classifier performance using test setEvaluate classifier performance using test set
Sample Data Set – ARFF FormatSample Data Set – ARFF Format
Utilities for Splitting the SamplesUtilities for Splitting the SamplesADaM has utilities for splitting data sets into disjoint groups for training and testing ADaM has utilities for splitting data sets into disjoint groups for training and testing classifiersclassifiers
The simplest is ITSC_Sample, which splits the source data set into two disjoint The simplest is ITSC_Sample, which splits the source data set into two disjoint subsetssubsetsExample: split data set into two groups, one with 2/3 of the patterns and another Example: split data set into two groups, one with 2/3 of the patterns and another with 1/3 of the patterns:with 1/3 of the patterns:
ITSC_Sample -c class -i bcw.arff -o trn.arff -t tst.arff –p 0.66ITSC_Sample -c class -i bcw.arff -o trn.arff -t tst.arff –p 0.66
The –i argument specifies the input file nameThe –i argument specifies the input file name The –o and –t arguments specify the names of the two output files (-o = output one, -t = The –o and –t arguments specify the names of the two output files (-o = output one, -t =
output two)output two) The –p argument specifies the portion of data that goes into output one (trn.arff), the The –p argument specifies the portion of data that goes into output one (trn.arff), the
remainder goes to output two (tst.arff)remainder goes to output two (tst.arff) The –c argument tells the sample program which attribute is the class attributeThe –c argument tells the sample program which attribute is the class attribute
Training the ClassifierTraining the Classifier
ADaM has several different types of classifiersADaM has several different types of classifiersEach classifier has a training method and an application methodEach classifier has a training method and an application methodExample: Naïve Bayes classifierExample: Naïve Bayes classifier
ITSC_NaiveBayesTrain -c class -i trn.arff –b bayes.txtITSC_NaiveBayesTrain -c class -i trn.arff –b bayes.txt
The –i argument specifies the input file nameThe –i argument specifies the input file name The –c argument specifies the name of the class attributeThe –c argument specifies the name of the class attribute The –b argument specifies the name of the classifier file:The –b argument specifies the name of the classifier file:
Applying the ClassifierApplying the Classifier
Once trained, the Naïve Bayes classifier can be used to classify unknown instancesOnce trained, the Naïve Bayes classifier can be used to classify unknown instancesFor this demo, the classifier is run as follows:For this demo, the classifier is run as follows:
ITSC_NaiveBayesApply -c class -i tst.arff –b bayes.txt -o res_tst.arffITSC_NaiveBayesApply -c class -i tst.arff –b bayes.txt -o res_tst.arff The –i argument specifies the input file nameThe –i argument specifies the input file name The –c argument specifies the name of the class attributeThe –c argument specifies the name of the class attribute The –b argument specifies the name of the classifier fileThe –b argument specifies the name of the classifier file The –o argument specifies the name of the result fileThe –o argument specifies the name of the result file
Evaluating Classifier PerformanceEvaluating Classifier Performance
By applying the classifier to a test set where the correct class is known in By applying the classifier to a test set where the correct class is known in advance, it is possible to compare the expected output to the actual advance, it is possible to compare the expected output to the actual output.output.ITSC_Accuracy is run as follows:ITSC_Accuracy is run as follows:
ITSC_Accuracy -c class -t res_tst.arff –v tst.arff –o acc_tst.txtITSC_Accuracy -c class -t res_tst.arff –v tst.arff –o acc_tst.txt
Python Script for ClassificationPython Script for Classification
Additional InformationAdditional Information
http://datamining.itsc.uah.edu/adam/http://datamining.itsc.uah.edu/adam/ Additional informationAdditional information DocumentationDocumentation Download ADaM 4.0 executables (Windows Download ADaM 4.0 executables (Windows
and Linux)and Linux)
Comparing Different Thresholding Comparing Different Thresholding Algorithms for Segmenting Auroras Algorithms for Segmenting Auroras
Case Study using 130 images from Case Study using 130 images from UVI observations on September 14, UVI observations on September 14, 1997, covering the time period from 1997, covering the time period from
8:30 UT and 11:27 UT 8:30 UT and 11:27 UT
Segmenting Auroral EventsSegmenting Auroral Events
Spacecraft UV images observing auroral events contain Spacecraft UV images observing auroral events contain two regions, an auroral oval and the background two regions, an auroral oval and the background Under ideal circumstances, the histogram of these Under ideal circumstances, the histogram of these images has two distinct modes and a threshold value images has two distinct modes and a threshold value can be determined to separate the two regions can be determined to separate the two regions Different factors such as the date, time of the day, and Different factors such as the date, time of the day, and satellite position all affect the luminosity gradient of the satellite position all affect the luminosity gradient of the UV image making the two regions overlap and thereby UV image making the two regions overlap and thereby making the threshold selection a non trivial problemmaking the threshold selection a non trivial problem Objective of this study:Objective of this study:Compare different thresholding techniques and Compare different thresholding techniques and algorithms for segmenting auroral events in Polar UV algorithms for segmenting auroral events in Polar UV imagesimages
Thresholding Thresholding
Global ThresholdingGlobal Thresholding uses a fixed threshold for all pixels in the imageuses a fixed threshold for all pixels in the image works only if the intensity histogram of the input works only if the intensity histogram of the input
image contains neatly separated peaks corresponding image contains neatly separated peaks corresponding to the desired subject(s) and background(s)to the desired subject(s) and background(s)
cannot deal with images containing, for example, a cannot deal with images containing, for example, a strong illumination gradient. strong illumination gradient.
Adaptive/Local ThresholdingAdaptive/Local Thresholding selects an individual threshold for each pixel based on selects an individual threshold for each pixel based on
the range of intensity values in its local the range of intensity values in its local neighbourhoodneighbourhood
allows for thresholding of an image whose global allows for thresholding of an image whose global intensity histogram doesn't contain distinctive peaks intensity histogram doesn't contain distinctive peaks
Methodology Used in this StudyMethodology Used in this Study
Test global thresholding technique using the following Test global thresholding technique using the following algorithmsalgorithms
Mixture ModelingMixture Modeling Fuzzy SetFuzzy Set EntropyEntropy
Develop adaptive thresholding technique using context Develop adaptive thresholding technique using context information and the following algorithmsinformation and the following algorithms
Modified Mixture ModelingModified Mixture Modeling Fuzzy SetFuzzy Set EntropyEntropy Edge Based DetectionEdge Based Detection
Test and evaluateTest and evaluate
Mixture ModelingMixture Modeling
Mixture modeling thresholding algorithm assumes that Mixture modeling thresholding algorithm assumes that the object and the background are distributed normally the object and the background are distributed normally and the threshold is calculated by minimizing the error by and the threshold is calculated by minimizing the error by fitting the two Gaussian distributions to the histogram, fitting the two Gaussian distributions to the histogram, The Gaussian distributions are described by:The Gaussian distributions are described by:
The least square minimization function is given byThe least square minimization function is given by
)2/)(( 22
)(
x
eP
xG
v
i
ie
NP
0
)2/)(( 22
255
0
221 ))()()((
i
iFiGiGR
Fuzzy SetsFuzzy Sets
The fuzzy sets algorithm uses a membership function to The fuzzy sets algorithm uses a membership function to define a fuzzy object regiondefine a fuzzy object regionIn a normal set theory, a set has no elements in common In a normal set theory, a set has no elements in common between itself and its complement. between itself and its complement. In a fuzzy set, each element may belong to a set and its In a fuzzy set, each element may belong to a set and its complement with certain probabilities. complement with certain probabilities. Yager (1979) defined a measure of fuzziness for the Yager (1979) defined a measure of fuzziness for the degree with which a set and its compliment were degree with which a set and its compliment were indistinguishable. This is given by the following indistinguishable. This is given by the following expressionexpression
The gray level value that minimizes the fuzziness is the The gray level value that minimizes the fuzziness is the threshold value for the image.threshold value for the image.
p
i
p
xxp iuiutD
1
)()()(
EntropyEntropy
Entropy is a measure commonly used in information Entropy is a measure commonly used in information theory and characterizes the impurity of a data. The theory and characterizes the impurity of a data. The entropy of the object and the background for a given entropy of the object and the background for a given threshold threshold tt can be calculated by using: can be calculated by using:
and where and where pjpj is the histogram value at gray level j is the histogram value at gray level j
Thus, finding the threshold for the image now becomes Thus, finding the threshold for the image now becomes an optimization. an optimization. The gray level value that maximizes the entropy for the The gray level value that maximizes the entropy for the sum of H0 and Hw is used as the threshold.sum of H0 and Hw is used as the threshold.
t
jt
ii
j
t
ii
j
p
p
p
pH
0
00
0 log
255
1255
1
255
1
logtj
tii
j
tii
jw
p
p
p
pH
Edge Based DetectionEdge Based Detection
The edge based method identifies the aurora region by detecting the The edge based method identifies the aurora region by detecting the transition zone between aurora and backgroundtransition zone between aurora and backgroundThis is an adaptive technique that uses context information to subdivide the This is an adaptive technique that uses context information to subdivide the image into subzonesimage into subzonesUsing the domain knowledge that the auroras are centered on the Using the domain knowledge that the auroras are centered on the magnetic pole, radial slices at a certain Magnetic Local Time (MLT) and magnetic pole, radial slices at a certain Magnetic Local Time (MLT) and starting from the magnetic pole are used to divide the image into subzonesstarting from the magnetic pole are used to divide the image into subzonesThe rate of change of the intensity along the magnetic latitude is calculated The rate of change of the intensity along the magnetic latitude is calculated using the following formulausing the following formula
The transition zones show a sharp change in intensity between the aurora The transition zones show a sharp change in intensity between the aurora and backgroundand backgroundTherefore, by detecting the maximum and minimum gradient location, the Therefore, by detecting the maximum and minimum gradient location, the poleward and equatorward boundaries of aurora oval are identified for this poleward and equatorward boundaries of aurora oval are identified for this MLT MLT
),,(),()( MLTLatIMLTLatLatILat
IMLT
Global Thresholding Result: Global Thresholding Result: Sept, 14, 1997 image, 08:41:53 UTC Sept, 14, 1997 image, 08:41:53 UTC
ORIGINAL IMAGE IMAGE HISTOGRAM
MIXTURE MODELING (64) ENTROPY (122)FUZZY SETS (132)
Global Thresholding Result: Global Thresholding Result: Sept, 14, 1997 image,10:31:40 UTCSept, 14, 1997 image,10:31:40 UTC
ORIGINAL IMAGE IMAGE HISTOGRAM
MIXTURE MODELING (43) ENTROPY (130)FUZZY SETS (138)
Adaptive Thresholding ProcessAdaptive Thresholding Process
MLT(18) MIXTURE MODELING (91) FUZZY SETS(124) ENTROPY(117)
MLT(20) MIXTURE MODELING (52) FUZZY SETS(123) ENTROPY(162)
Adaptive Thresholding Process:Adaptive Thresholding Process:Edge Based DetectionEdge Based Detection
MLT (18)
MLT( 20)
Adaptive Thresholding Results:Adaptive Thresholding Results:Sept 14, 1997 image 09:05:48 UTC Sept 14, 1997 image 09:05:48 UTC A
B C
D E
A. Original Image B. Mixture Modeling C. Entropy D. Fuzzy Sets E. Gradient
Adaptive Thresholding Results:Adaptive Thresholding Results:Sept 14, 1997 image 08:31:27 UTC Sept 14, 1997 image 08:31:27 UTC
A. Original Image B. Mixture Modeling C. Entropy D. Fuzzy Sets E. Gradient
AB C
D E
AnalysisAnalysis
Global Thresholding Techniques:Global Thresholding Techniques: As expected, global thresholding techniques do not perform well As expected, global thresholding techniques do not perform well
in aurora detection in aurora detection Fuzzy sets and Entropy based thresholding algorithms in Fuzzy sets and Entropy based thresholding algorithms in
general tend to overestimate the threshold and are unable to general tend to overestimate the threshold and are unable to detect the entire aurora detect the entire aurora
Mixture modeling algorithm on the other hand underestimates Mixture modeling algorithm on the other hand underestimates the threshold and may label some portion of background as the threshold and may label some portion of background as auroraaurora
Adaptive Thresholding Techniques:Adaptive Thresholding Techniques: Adaptive thresholding techniques show better results for the Adaptive thresholding techniques show better results for the
given set of images as compared to global thresholdinggiven set of images as compared to global thresholding Fuzzy set, Entropy, Edge Based Detection algorithms all are Fuzzy set, Entropy, Edge Based Detection algorithms all are
unable to do as good a job as Mixture modelingunable to do as good a job as Mixture modeling
Future WorkFuture WorkInvestigate the behavior of these thresholding algorithms for oval Investigate the behavior of these thresholding algorithms for oval boundary detection in the presence of day glow and other types of boundary detection in the presence of day glow and other types of noise noise Investigate the use of Chow and Kaneko (1972) technique for Investigate the use of Chow and Kaneko (1972) technique for adaptive thresholding adaptive thresholding
Current adaptive thresholding scheme divides an image into subzones Current adaptive thresholding scheme divides an image into subzones based on the MLT time. based on the MLT time.
Disadvantage: Disadvantage: It requires auxiliary MLT information. It requires auxiliary MLT information. The numbers of pixels in each subzone are not uniform, and at times, improper The numbers of pixels in each subzone are not uniform, and at times, improper
threshold is obtainedthreshold is obtained Chow and Kaneko method partitions the image into non-overlapping Chow and Kaneko method partitions the image into non-overlapping
blocks of equal area and the threshold for each block computed blocks of equal area and the threshold for each block computed independentlyindependently
The thresholds for the blocks are then interpolated over the entire imageThe thresholds for the blocks are then interpolated over the entire image
The best thresholding algorithms will be incorporated to ADaM 4.0The best thresholding algorithms will be incorporated to ADaM 4.0