Upload
scarlett-randall
View
227
Download
0
Tags:
Embed Size (px)
Citation preview
SDO Progress Presentation
Agenda
Benchmark dataset– Acquisition– Future additions– Class balancing and problems
Image Processing– Image parameters to extract– Segmentation– Extraction times– Future parameters to add
Unsupervised Attribute Evaluation– Correlation Maps– Multi Dimensional Scaling Maps
Classifiers
Benchmark Dataset: Creation
Searched the for solar events in Heliophysics Event Knowledgebase (HEK) from 1999-11-03 to 2008-11-04
Extracted 1,600 Images from the TRACE Mission Image repository
Benchmark Dataset: Class balancing
We selected 8 different phenomena (or classes) and we retrieved 200 random images for each phenomena
For phenomena with 200+ reported events we selected one random image of each event
For phenomena with less than 200 reported events we selected images from the same event but taken at a different time
We selected images in the171 and 1600 wavelength due to similarity (all from the TRACE mission)
We re-sized all images to be 1024x1024 pixels
Benchmark Dataset: Problems
The HEK reports 14 different types of events (at the time of benchmark generation)
– There are not enough, or no events reported for several phenomena
– Different phenomena in different wavelengths– Different image resolutions between same phenomena– Several phenomena missing (Sigmoids, Polarity Inversion
Line Mapping, Bright Points) So far we received response from Manolis K. Georgoulis regarding images containing sigmoids
Benchmark Dataset: Future Additions
Add the missing 6 classes of events if we can find suitable images
Add more images to our benchmark dataset
Create different ‘benchmarks’ for other wavelengths, since images between some wavelengths are very different
Benchmark Dataset
Event Name # of Images retrieved
Wavelength Resolution (count) HEK reported events
Active Region 200 1600 768x768 (200) 200+
Coronal Jet 200 171 1024x1024 (155)768x768 (45)
25
Emerging Flux 200 1600 768x768 (200) 12
Filament 200 171 1024x1024 (127)768x768 (73)
45
Filament Activation 200 171 1024x1024 (200) 27
FilamentEruption
200 171 1024x1024 (49)768x768 (151)
53
Flare 200 171 1024x1024 (64)768x768 (136)
200+
Oscillation 200 171 1024x1024 (50)768x768 (150)
3
Image Processing
Our goal in mind for selecting image processing techniques and parameters is:
– Fast extraction of parameters from images– Help to clearly distinguish between different
phenomena (not an easy task to do)
Image Processing: Parameters to Extract
Image parameter:
Entropy
Mean
Standard Deviation
3rd Moment (skewness)
4th Moment (kurtosis)
Relative Smoothness
Fractal Dimension
Tamura Contrast
Tamura Directionality
Histogram R
Histogram J
• So far we have extracted the following textural parameters
Image Processing: Segmentation
Segmenting the images breaks the images into smaller pieces. These smaller pieces allow us to analyze pieces of the image and decide what parts are interesting.
We used grid segmentation split the images into 128 by 128 pixel blocks and we extracted all our parameters from each block
Image Processing: Segmentation Example
Image Processing: Extraction Time
Total time for 1600 images
0
50
100
150
200
250
300
350
400
450
4 x 4 8 x 8 16 x 16
Grid
Tim
e in
Sec
on
ds
Tamura Directionality
Tamura Contrast
Histogram R
Histogram J
Kurtosis
Fractal Dimension
Entropy
Standard Deviation
Skewness
RS
Uniformity
Mean
Image Processing: Future Parameters to Add
Feature Type
Asymmetry index Asymmetry
Lengthening index Asymmetry
Extent Asymmetry
Middle divergence Asymmetry
Compactness (irregularity) index, circularity, CIRC
Border irregularity
Regularity of contour Border irregularity
Edge abruptness, sharpness Border irregularity
Pigmentation transition Border irregularity
Greatest diameter Others
Thinness ratio Others
Lesion area and perimeter Others
Minimum diameter Others
Transition area, transition region imbalance Others
Background region imbalance Others
Peripheral dark. Others
Roundness Others
Fullness ratio Others
Unsupervised Attribute Evaluation- Preliminary Results -
Why Correlation Analysis?
If parameters are highly correlated we can use just one of them for tuning our classifiers.
Removal of correlated parameters will shorten processing time (especially important since we plan to implement them on SDO’s pipeline), and future storage (the smaller the storage use is, the faster image retrieval is going to perform)
Unsupervised Attribute Evaluation: Correlation Maps Within Same Class
Unsupervised Attribute Evaluation: Correlation Maps Within Same Class
Unsupervised Attribute Evaluation: Correlation Maps Within Same Class
We can reduce dimensionality certainly in two different cases: 3rd Moment and 4th moment since in the majority of our correlation maps, they are always strongly correlated. The same holds for Uniformity and RS. Any other parameters that appeared as strongly correlated during our experiments (e.g. Tamura Contrast and Standard deviation) have to be analyzed further, since their correlation is inconsistent across different phenomena types.
We were also able to identify a few correlations that only occur within certain types of phenomena (e.g. Filament, Filament Activation and Filament Eruption). This would allow us in the future to be able to discern these phenomena from other ones, and focus our parameter analysis in determining the phenomena occurring during that reduced set of classes.
Unsupervised Attribute Evaluation: Multi-Dimensional Scaling for previously shown
correlation maps
Unsupervised Attribute Evaluation: Multi-Dimensional Scaling for previously shown
correlation maps
Classifiers
Based on our correlation analysis we selected 3 different scenarios to run our classification experiments (based on the Active Region class):
1) Original 10 image parameters
2) Removing 3 uncorrelated parameters: Fractal Dimension and Tamura Directionality and Contrast
3) Removing 3 of the parameters that are correlated with other parameters: Standard Deviation, Uniformity and Tamura Contrast
Classifiers
We ran all 8 classes plus Empty Sun as a class
We used the NaiveBayes classifier using 10 fold cross validation
.2
2measure-F
FPFNTP
TP
,RecallFNTP
TP
,Precision
FPTP
TP
Classifiers: Evaluation Measures
There are many different measures used to evaluate accuracy of the classifiers. Many popular ones use Confusion Matrix as a source of information about the classifier:
Commonly used measures are Precision, Recall, and F-measure, which can be defined as (using the symbols from the Confusion Matrix above):
Classifiers: Evaluation Measures
Other commonly used measures are ROC Curves (Receiver Operating Characteristic), which characterize the trade-off between true positive hits and false alarms.
The Area Under the ROC Curve (ROC area)
is a measure of the accuracy of the model – a predictor model with perfect accuracy
will have an area of 1.0– The closer we are to the diagonal line
(i.e., the closer the area is to 0.5), the less accurate is the model.
Preliminary Results - Classifiers: Naïve Bayes
Naïve Bayes is a linear classifier
A linear classifier achieves the grouping of items that have similar feature values into groups by making a classification decision based on the value of the linear combination of the features.
Preliminary Results - Classifiers: Naïve Bayes -
Using all 10 features:
Correctly Classified Instances 4421 31.6509 %Incorrectly Classified Instances 9547 68.3491 %
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.838 0.194 0.351 0.838 0.494 0.898 ActiveRegion 0.152 0.033 0.368 0.152 0.215 0.775 CoronalJet 0.773 0.112 0.463 0.773 0.579 0.929 EmergingFlux 0.077 0.056 0.148 0.077 0.102 0.686 Empty 0.067 0.007 0.533 0.067 0.119 0.692 Filament 0.357 0.073 0.38 0.357 0.368 0.808 FilamentActivation 0.463 0.215 0.212 0.463 0.291 0.747 FilamentExplosion 0.002 0.01 0.024 0.002 0.004 0.62 Flare 0.12 0.07 0.177 0.12 0.143 0.692 Oscillation
Preliminary Results - Classifiers: Naïve Bayes -
Removing 3 uncorrelated parameters: Fractal Dimension and Tamura Directionality and Contrast
Correctly Classified Instances 3993 28.5868 %Incorrectly Classified Instances 9975 71.4132 %
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.815 0.282 0.265 0.815 0.4 0.846 ActiveRegion 0.073 0.022 0.293 0.073 0.117 0.674 CoronalJet 0.8 0.142 0.413 0.8 0.545 0.924 EmergingFlux 0.001 0.007 0.011 0.001 0.001 0.577 Empty 0.001 0 0.143 0.001 0.001 0.605 Filament 0.377 0.085 0.357 0.377 0.367 0.778 FilamentActivation 0.481 0.233 0.205 0.481 0.287 0.725 FilamentExplosion 0.003 0.01 0.039 0.003 0.006 0.557 Flare 0.022 0.021 0.116 0.022 0.037 0.664 Oscillation
Conclusion: By removing parameters “randomly” we can lower our chances of recognizing solar phenomena (we dropped from 31% to 28%).
Preliminary Results - Classifiers:Naïve Bayes -
Removing 3 correlated parameters: Standard Deviation, Uniformity and Tamura Contrast
Correctly Classified Instances 4641 33.2259 %Incorrectly Classified Instances 9327 66.7741 %
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.808 0.156 0.393 0.808 0.529 0.902 ActiveRegion 0.22 0.056 0.33 0.22 0.264 0.776 CoronalJet 0.811 0.137 0.426 0.811 0.558 0.93 EmergingFlux 0.073 0.064 0.126 0.073 0.093 0.671 Empty 0.067 0.01 0.452 0.067 0.117 0.692 Filament 0.391 0.076 0.392 0.391 0.392 0.818 FilamentActivation 0.512 0.218 0.227 0.512 0.314 0.757 FilamentExplosion 0.059 0.019 0.282 0.059 0.098 0.643 Flare 0.048 0.016 0.276 0.048 0.081 0.696 Oscillation
Conclusion: Here by removing 3 parameters (out of 10), that were highly correlated to other parameters, we reduce storage and processing time,while being able to maintain comparable accuracy (33%, vs. 31% originally).
Preliminary Results - Classifiers: C4.5
Based on the dimensionality and distribution of our values from our Image Parameters, we decided to investigate the results of a Decision Tree classifier, as the next step after a linear classifier.
A Decision Tree classifier has the goal of creating a model that predicts the value of a target variable based on several input variables. Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf.
Using all 10 features:
Correctly Classified Instances 9163 65.5999 %Incorrectly Classified Instances 4805 34.4001 %
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.757 0.044 0.681 0.757 0.717 0.902 ActiveRegion 0.636 0.055 0.589 0.636 0.612 0.831 CoronalJet 0.799 0.023 0.812 0.799 0.805 0.919 EmergingFlux 0.668 0.032 0.725 0.668 0.696 0.908 Empty 0.624 0.048 0.62 0.624 0.622 0.816 Filament 0.765 0.033 0.746 0.765 0.755 0.902 FilamentActivation 0.484 0.056 0.518 0.484 0.5 0.785 FilamentExplosion 0.534 0.051 0.567 0.534 0.55 0.796 Flare 0.637 0.045 0.641 0.637 0.639 0.845 Oscillation
Preliminary Results - Classifiers: C4.5 -
Preliminary Results - Classifiers: C4.5 -
Removing 3 uncorrelated parameters: Fractal Dimension and Tamura Directionality and Contrast
Correctly Classified Instances 8278 59.264 %Incorrectly Classified Instances 5690 40.736 %
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.653 0.067 0.549 0.653 0.596 0.841 ActiveRegion 0.587 0.061 0.546 0.587 0.566 0.821 CoronalJet 0.778 0.031 0.759 0.778 0.769 0.908 EmergingFlux 0.573 0.041 0.634 0.573 0.602 0.861 Empty 0.564 0.055 0.56 0.564 0.562 0.791 Filament 0.772 0.035 0.731 0.772 0.751 0.908 FilamentActivation 0.414 0.063 0.452 0.414 0.432 0.747 FilamentExplosion 0.424 0.058 0.478 0.424 0.449 0.732 Flare 0.568 0.046 0.605 0.568 0.586 0.83 Oscillation
Conclusion: Again, by removing parameters “randomly” we can lower our chances of recognizing solar phenomena (we dropped from 65% to 59%).
Preliminary Results - Classifiers: C4.5 -
Removing 3 correlated parameters: Standard Deviation, Uniformity and Tamura Contrast
Correctly Classified Instances 8876 63.5452 %Incorrectly Classified Instances 5092 36.4548 %
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.716 0.04 0.692 0.716 0.704 0.898 ActiveRegion 0.586 0.056 0.567 0.586 0.576 0.822 CoronalJet 0.791 0.029 0.773 0.791 0.782 0.914 EmergingFlux 0.678 0.039 0.687 0.678 0.682 0.905 Empty 0.61 0.053 0.589 0.61 0.599 0.811 Filament 0.767 0.034 0.737 0.767 0.752 0.9 FilamentActivation 0.448 0.057 0.498 0.448 0.472 0.776 FilamentExplosion 0.506 0.055 0.537 0.506 0.521 0.778 Flare 0.615 0.048 0.616 0.615 0.616 0.838 Oscillation
Conclusion: And once again, by removing 3 parameters, that were highly correlated to other parameters, we were able to maintain comparable accuracy (63%, vs. 65% originally – minimal decrease noted).
Preliminary Results - Classifiers: Adaboost C4.5
Going after a better results? Boosting may help
AdaBoost, short for Adaptive Boosting, is a machine learning algorithm, is adaptive in the sense that subsequent classifiers built are tweaked in favor of those instances misclassified by previous classifiers
We decided to use this boosting algorithm in order to improve our classification results
Using all 10 features:
Correctly Classified Instances 10114 72.4084 %Incorrectly Classified Instances 3854 27.5916 %
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.781 0.032 0.755 0.781 0.768 0.964 ActiveRegion 0.715 0.042 0.681 0.715 0.697 0.943 CoronalJet 0.838 0.017 0.861 0.838 0.849 0.968 EmergingFlux 0.756 0.031 0.752 0.756 0.754 0.947 Empty 0.698 0.037 0.704 0.698 0.701 0.942 Filament 0.847 0.026 0.801 0.847 0.823 0.971 FilamentActivation 0.546 0.044 0.607 0.546 0.575 0.899 FilamentExplosion 0.617 0.04 0.656 0.617 0.636 0.911 Flare 0.719 0.041 0.685 0.719 0.701 0.945 Oscillation
Preliminary Results - Classifiers: Adaboost C4.5 -
Preliminary Results - Classifiers: Adaboost C4.5 -
Removing 3 uncorrelated parameters: Fractal Dimension and Tamura Directionality and Contrast
Correctly Classified Instances 8920 63.8603 %Incorrectly Classified Instances 5048 36.1397 %
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.66 0.058 0.589 0.66 0.622 0.923 ActiveRegion 0.646 0.049 0.622 0.646 0.634 0.909 CoronalJet 0.811 0.025 0.802 0.811 0.807 0.952 EmergingFlux 0.653 0.043 0.656 0.653 0.654 0.912 Empty 0.591 0.043 0.634 0.591 0.612 0.896 Filament 0.795 0.033 0.752 0.795 0.773 0.959 FilamentActivation 0.461 0.059 0.495 0.461 0.477 0.843 FilamentExplosion 0.486 0.053 0.533 0.486 0.509 0.844 Flare 0.644 0.045 0.644 0.644 0.644 0.913 Oscillation
Conclusion: And again, by removing parameters “randomly” we can lower our chances of recognizing solar phenomena (we dropped from 72% to 63%).
Preliminary Results - Classifiers: Adaboost C4.5 -
Removing 3 correlated parameters: Standard Deviation, Uniformity and Tamura Contrast
Correctly Classified Instances 9706 69.4874 %Incorrectly Classified Instances 4262 30.5126 %
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.776 0.033 0.747 0.776 0.761 0.956 ActiveRegion 0.667 0.046 0.644 0.667 0.655 0.927 CoronalJet 0.826 0.018 0.851 0.826 0.838 0.965 EmergingFlux 0.737 0.031 0.748 0.737 0.742 0.94 Empty 0.663 0.04 0.672 0.663 0.668 0.925 Filament 0.809 0.029 0.776 0.809 0.792 0.962 FilamentActivation 0.51 0.054 0.544 0.51 0.526 0.874 FilamentExplosion 0.59 0.049 0.6 0.59 0.595 0.899 Flare 0.676 0.043 0.664 0.676 0.67 0.931 Oscillation
Conclusion: And for the last time - by removing 3 parameters (out of 10), that were highly correlated to other parameters, we reduce storage and processing time, while being able to maintain comparable accuracy (72%, vs. 69% originally – 3% decrease, when compared against 30% reduction in number of parameters seems like a good deal).
Preliminary Conclusions
Based on these preliminary classification results we can drop the following image parameters: Standard Deviation, Uniformity and Tamura Contrast in order to reduce storage space and computational costs, since we achieve similar classification percentages than with the complete set of parameters.