SDO Progress Presentation. Agenda Benchmark dataset – Acquisition – Future additions – Class balancing and problems Image Processing – Image parameters

SDO Progress Presentation

Agenda

Benchmark dataset– Acquisition– Future additions– Class balancing and problems

Image Processing– Image parameters to extract– Segmentation– Extraction times– Future parameters to add

Unsupervised Attribute Evaluation– Correlation Maps– Multi Dimensional Scaling Maps

Classifiers

Benchmark Dataset: Creation

Searched the for solar events in Heliophysics Event Knowledgebase (HEK) from 1999-11-03 to 2008-11-04

Extracted 1,600 Images from the TRACE Mission Image repository

Benchmark Dataset: Class balancing

We selected 8 different phenomena (or classes) and we retrieved 200 random images for each phenomena

For phenomena with 200+ reported events we selected one random image of each event

For phenomena with less than 200 reported events we selected images from the same event but taken at a different time

We selected images in the171 and 1600 wavelength due to similarity (all from the TRACE mission)

We re-sized all images to be 1024x1024 pixels

Benchmark Dataset: Problems

The HEK reports 14 different types of events (at the time of benchmark generation)

– There are not enough, or no events reported for several phenomena

– Different phenomena in different wavelengths– Different image resolutions between same phenomena– Several phenomena missing (Sigmoids, Polarity Inversion

Line Mapping, Bright Points) So far we received response from Manolis K. Georgoulis regarding images containing sigmoids

Benchmark Dataset: Future Additions

Add the missing 6 classes of events if we can find suitable images

Add more images to our benchmark dataset

Create different ‘benchmarks’ for other wavelengths, since images between some wavelengths are very different

Benchmark Dataset

Event Name # of Images retrieved

Wavelength Resolution (count) HEK reported events

Active Region 200 1600 768x768 (200) 200+

Coronal Jet 200 171 1024x1024 (155)768x768 (45)

25

Emerging Flux 200 1600 768x768 (200) 12

Filament 200 171 1024x1024 (127)768x768 (73)

45

Filament Activation 200 171 1024x1024 (200) 27

FilamentEruption

200 171 1024x1024 (49)768x768 (151)

53

Flare 200 171 1024x1024 (64)768x768 (136)

200+

Oscillation 200 171 1024x1024 (50)768x768 (150)

3

Image Processing

Our goal in mind for selecting image processing techniques and parameters is:

– Fast extraction of parameters from images– Help to clearly distinguish between different

phenomena (not an easy task to do)

Image Processing: Parameters to Extract

Image parameter:

Entropy

Mean

Standard Deviation

3rd Moment (skewness)

4th Moment (kurtosis)

Relative Smoothness

Fractal Dimension

Tamura Contrast

Tamura Directionality

Histogram R

Histogram J

• So far we have extracted the following textural parameters

Image Processing: Segmentation

Segmenting the images breaks the images into smaller pieces. These smaller pieces allow us to analyze pieces of the image and decide what parts are interesting.

We used grid segmentation split the images into 128 by 128 pixel blocks and we extracted all our parameters from each block

Image Processing: Segmentation Example

Image Processing: Extraction Time

Total time for 1600 images

0

50

100

150

200

250

300

350

400

450

4 x 4 8 x 8 16 x 16

Grid

Tim

e in

Sec

on

ds

Tamura Directionality

Tamura Contrast

Histogram R

Histogram J

Kurtosis

Fractal Dimension

Entropy

Standard Deviation

Skewness

RS

Uniformity

Mean

Image Processing: Future Parameters to Add

Feature Type

Asymmetry index Asymmetry

Lengthening index Asymmetry

Extent Asymmetry

Middle divergence Asymmetry

Compactness (irregularity) index, circularity, CIRC

Border irregularity

Regularity of contour Border irregularity

Edge abruptness, sharpness Border irregularity

Pigmentation transition Border irregularity

Greatest diameter Others

Thinness ratio Others

Lesion area and perimeter Others

Minimum diameter Others

Transition area, transition region imbalance Others

Background region imbalance Others

Peripheral dark. Others

Roundness Others

Fullness ratio Others

Unsupervised Attribute Evaluation- Preliminary Results -

Why Correlation Analysis?

If parameters are highly correlated we can use just one of them for tuning our classifiers.

Removal of correlated parameters will shorten processing time (especially important since we plan to implement them on SDO’s pipeline), and future storage (the smaller the storage use is, the faster image retrieval is going to perform)

Unsupervised Attribute Evaluation: Correlation Maps Within Same Class



We can reduce dimensionality certainly in two different cases: 3rd Moment and 4th moment since in the majority of our correlation maps, they are always strongly correlated. The same holds for Uniformity and RS. Any other parameters that appeared as strongly correlated during our experiments (e.g. Tamura Contrast and Standard deviation) have to be analyzed further, since their correlation is inconsistent across different phenomena types.

We were also able to identify a few correlations that only occur within certain types of phenomena (e.g. Filament, Filament Activation and Filament Eruption). This would allow us in the future to be able to discern these phenomena from other ones, and focus our parameter analysis in determining the phenomena occurring during that reduced set of classes.

Unsupervised Attribute Evaluation: Multi-Dimensional Scaling for previously shown

correlation maps

Unsupervised Attribute Evaluation: Multi-Dimensional Scaling for previously shown

correlation maps

Classifiers

Based on our correlation analysis we selected 3 different scenarios to run our classification experiments (based on the Active Region class):

1) Original 10 image parameters

2) Removing 3 uncorrelated parameters: Fractal Dimension and Tamura Directionality and Contrast

3) Removing 3 of the parameters that are correlated with other parameters: Standard Deviation, Uniformity and Tamura Contrast

Classifiers

We ran all 8 classes plus Empty Sun as a class

We used the NaiveBayes classifier using 10 fold cross validation

.2

2measure-F

FPFNTP

TP

,RecallFNTP

TP

,Precision

FPTP

TP

Classifiers: Evaluation Measures

There are many different measures used to evaluate accuracy of the classifiers. Many popular ones use Confusion Matrix as a source of information about the classifier:

Commonly used measures are Precision, Recall, and F-measure, which can be defined as (using the symbols from the Confusion Matrix above):

Classifiers: Evaluation Measures

Other commonly used measures are ROC Curves (Receiver Operating Characteristic), which characterize the trade-off between true positive hits and false alarms.

The Area Under the ROC Curve (ROC area)

is a measure of the accuracy of the model – a predictor model with perfect accuracy

will have an area of 1.0– The closer we are to the diagonal line

(i.e., the closer the area is to 0.5), the less accurate is the model.

Preliminary Results - Classifiers: Naïve Bayes

Naïve Bayes is a linear classifier

A linear classifier achieves the grouping of items that have similar feature values into groups by making a classification decision based on the value of the linear combination of the features.

Preliminary Results - Classifiers: Naïve Bayes -

Using all 10 features:

Correctly Classified Instances 4421 31.6509 %Incorrectly Classified Instances 9547 68.3491 %

TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.838 0.194 0.351 0.838 0.494 0.898 ActiveRegion 0.152 0.033 0.368 0.152 0.215 0.775 CoronalJet 0.773 0.112 0.463 0.773 0.579 0.929 EmergingFlux 0.077 0.056 0.148 0.077 0.102 0.686 Empty 0.067 0.007 0.533 0.067 0.119 0.692 Filament 0.357 0.073 0.38 0.357 0.368 0.808 FilamentActivation 0.463 0.215 0.212 0.463 0.291 0.747 FilamentExplosion 0.002 0.01 0.024 0.002 0.004 0.62 Flare 0.12 0.07 0.177 0.12 0.143 0.692 Oscillation

Preliminary Results - Classifiers: Naïve Bayes -

Removing 3 uncorrelated parameters: Fractal Dimension and Tamura Directionality and Contrast


TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.815 0.282 0.265 0.815 0.4 0.846 ActiveRegion 0.073 0.022 0.293 0.073 0.117 0.674 CoronalJet 0.8 0.142 0.413 0.8 0.545 0.924 EmergingFlux 0.001 0.007 0.011 0.001 0.001 0.577 Empty 0.001 0 0.143 0.001 0.001 0.605 Filament 0.377 0.085 0.357 0.377 0.367 0.778 FilamentActivation 0.481 0.233 0.205 0.481 0.287 0.725 FilamentExplosion 0.003 0.01 0.039 0.003 0.006 0.557 Flare 0.022 0.021 0.116 0.022 0.037 0.664 Oscillation

Conclusion: By removing parameters “randomly” we can lower our chances of recognizing solar phenomena (we dropped from 31% to 28%).

Preliminary Results - Classifiers:Naïve Bayes -

Removing 3 correlated parameters: Standard Deviation, Uniformity and Tamura Contrast



Conclusion: Here by removing 3 parameters (out of 10), that were highly correlated to other parameters, we reduce storage and processing time,while being able to maintain comparable accuracy (33%, vs. 31% originally).

Preliminary Results - Classifiers: C4.5

Based on the dimensionality and distribution of our values from our Image Parameters, we decided to investigate the results of a Decision Tree classifier, as the next step after a linear classifier.

A Decision Tree classifier has the goal of creating a model that predicts the value of a target variable based on several input variables. Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf.




Preliminary Results - Classifiers: C4.5 -





Conclusion: Again, by removing parameters “randomly” we can lower our chances of recognizing solar phenomena (we dropped from 65% to 59%).





Conclusion: And once again, by removing 3 parameters, that were highly correlated to other parameters, we were able to maintain comparable accuracy (63%, vs. 65% originally – minimal decrease noted).

Preliminary Results - Classifiers: Adaboost C4.5

Going after a better results? Boosting may help

AdaBoost, short for Adaptive Boosting, is a machine learning algorithm, is adaptive in the sense that subsequent classifiers built are tweaked in favor of those instances misclassified by previous classifiers

We decided to use this boosting algorithm in order to improve our classification results




Preliminary Results - Classifiers: Adaboost C4.5 -





Conclusion: And again, by removing parameters “randomly” we can lower our chances of recognizing solar phenomena (we dropped from 72% to 63%).





Conclusion: And for the last time - by removing 3 parameters (out of 10), that were highly correlated to other parameters, we reduce storage and processing time, while being able to maintain comparable accuracy (72%, vs. 69% originally – 3% decrease, when compared against 30% reduction in number of parameters seems like a good deal).

Preliminary Conclusions

Based on these preliminary classification results we can drop the following image parameters: Standard Deviation, Uniformity and Tamura Contrast in order to reduce storage space and computational costs, since we achieve similar classification percentages than with the complete set of parameters.

Documents

SDO Progress Presentation. Agenda Benchmark dataset – Acquisition – Future additions – Class balancing and problems Image Processing – Image parameters