95
of the research projects at the Knowledge Discovery & Web Mining Lab at Univ. of Louisville Olfa Nasraoui Dept. of Computer Engineering & Computer Science Speed School of Engineering, University of Louisville Contact e-mail: [email protected] This work is supported by NSF CAREER Award IIS- 0431128, NASA Grant No. AISR-03-0077-0139 issued through the Office of Space Sciences, NSF Grant IIS-0431128, Kentucky Science & Engr. Foundation, and a grant from NSTC via US Navy.

Olfa Nasraoui Dept . of Computer Engineering & Computer Science

Embed Size (px)

DESCRIPTION

A journey through some of the research projects at the Knowledge Discovery & Web Mining Lab at Univ. of Louisville. Olfa Nasraoui Dept . of Computer Engineering & Computer Science Speed School of Engineering, University of Louisville Contact e-mail: [email protected]. - PowerPoint PPT Presentation

Citation preview

PowerPoint Presentation

A journey through some of the research projects at the Knowledge Discovery & Web Mining Lab at Univ. of LouisvilleOlfa NasraouiDept. of Computer Engineering & Computer ScienceSpeed School of Engineering, University of Louisville

Contact e-mail: [email protected] work is supported by NSF CAREER Award IIS-0431128, NASA Grant No. AISR-03-0077-0139 issued through the Office of Space Sciences, NSF Grant IIS-0431128, Kentucky Science & Engr. Foundation, and a grant from NSTC via US Navy.

Outline of talkData Mining BackgroundMining Footprints Left Behind by Surfers on the WebWeb Usage Mining: WebKDD process, Profiling & PersonalizationMining Illegal Contraband Exchanges on Peer to Peer NetworksMining Coronal Loops Created from Hot Plasma Eruptions on the Surface of the SunData Mining BackgroundToo Much Data!There is often information hidden in the data that is not always evidentHuman analysts may take weeks to discover useful informationMuch of the data is never analyzed at all

The Data GapTotal new disk (TB) since 1995Number of analysts From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering ApplicationsData MiningMany DefinitionsNon-trivial extraction of implicit, previously unknown and potentially useful information from dataExploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Typical DM tasks: Clustering, classification, association rule mining

Applications: wherever there is data:Business, World Wide Web (content, structure, usage), Biology, Medicine, Astronomy, Social Networks, E-learning, Images, , etc

5Mining Footprints Left Behind by Surfers on the WebIntroductionInformation overload: too much information to sift/browse through in order to find desired informationMost information on Web is actually irrelevant to a particular userThis is what motivated interest in techniques for Web personalizationAs they surf a website, users leave a wealth of historic data about what pages they have viewed, choices they have made, etcWeb Usage Mining: A branch of Web Mining (itself a branch of data mining) that aims to discover interesting patterns from Web usage data (typically Web Access Log data/clickstreams)Classical Knowledge Discovery Process For Web Usage MiningSource of Data: Web Clickstreams: that get recorded in Web Log Files:Date, time, IP Address/Cookie, URL accessed, etcGoal: Extract interesting user profiles: by categorizing user sessions into groups or clustersProfile 1: URLs a, b, cSessions in profile 1Profile 2: URLs x, y, z, wSessions in profile 2etc

Classical Knowledge Discovery Process For Web Usage MiningComplete KDD process: Complete KDD process: Classical Knowledge Discovery Process For Web Usage MiningComplete KDD process: Classical Knowledge Discovery Process For Web Usage MiningWeb PersonalizationWeb Personalization: Aims to adapt the Website according to the users activity or interestsIntelligent Web Personalization: often relies on Web Usage Mining (for user modeling)Recommender Systems: recommend items of interest to the users depending on their interestContent-based filtering: recommend items similar to the items liked by current userNo notion of community of users (specialize only to one user)Collaborative filtering: recommend items liked by similar usersCombine history of a community of users: explicit (ratings) or implicit (clickstreams)Hybrids: combine above (and others)

Focus of our researchDifferent Steps Of our Web Personalization SystemRecommendationsRecommendation EngineActive SessionPreprocessingData Mining:Transaction ClusteringAssociation Rule DiscoveryPattern DiscoverySite FilesServer LogsUser SessionsPost Processing /Derivation of User ProfilesSTEP 1: OFFLINE PROFILE DISCOVERYSTEP 2: ACTIVE RECOMMENDATION

User profiles/User ModelNasraoui: Web Usage Mining & Personalization in Noisy, Dynamic, and Ambiguous EnvironmentsChallenges & Questions in Web Usage MiningRecommendationsRecommendation EngineActive SessionPreprocessingData Mining:Transaction ClusteringAssociation Rule DiscoveryPattern DiscoverySite FilesServer LogsUser SessionsPost Processing /Derivation of User ProfilesSTEP 1: OFFLINE PROFILE DISCOVERYACTIVE RECOMMENDATION

Dealing with Ambiguity: Semantics? Implicit taxonomy? (Nasraoui, Krishnapuram, Joshi. 1999)Website hierarchy (can help disambiguation, but limited) Explicit taxonomy? (Nasraoui, Soliman, Badia, 2005)From DB associated w/ dynamic URLsContent taxonomy or ontology (can help disambiguation, powerful) Concept hierarchy generalization / URL compression / concept abstraction: (Saka & Nasraoui, 2006)How does abstraction affect quality of user models?User profiles/User ModelRecommendationsRecommendation EngineActive SessionPreprocessingData Mining:Transaction ClusteringAssociation Rule DiscoveryPattern DiscoverySite FilesServer LogsUser SessionsPost Processing /Derivation of User ProfilesSTEP 1: OFFLINE PROFILE DISCOVERYACTIVE RECOMMENDATION

User Profile Post-processing Criteria? (Saka & Nasraoui, 2006) Aggregated profiles (frequency average)? Robust profiles (discount noise data)? How do they really perform?How to validate? (Nasraoui & Goswami, SDM 2006)User profiles/User ModelChallenges & Questions in Web Usage MiningRecommendationsRecommendation EngineActive SessionPreprocessingData Mining:Transaction ClusteringAssociation Rule DiscoveryPattern DiscoverySite FilesServer LogsUser SessionsPost Processing /Derivation of User ProfilesSTEP 1: OFFLINE PROFILE DISCOVERYACTIVE RECOMMENDATION

Evolution: (Nasraoui, Cerwinske, Rojas, Gonzalez. CIKM 2006)Detecting & characterizing profile evolution & change?User profiles/User ModelChallenges & Questions in Web Usage MiningRecommendationsRecommendation EngineActive SessionPreprocessingData Mining:Transaction ClusteringAssociation Rule DiscoveryPattern DiscoverySite FilesServer LogsUser SessionsPost Processing /Derivation of User ProfilesSTEP 1: OFFLINE PROFILE DISCOVERYACTIVE RECOMMENDATION

In case of massive evolving data streams:Need stream data mining (Nasraoui et al. ICDM03, WebKDD 2003)Need stream-based recommender systems? (Nasraoui et al. CIKM 2006)How do stream-based recommender systems perform under evolution?How to validate above? (Nasraoui et al. CIKM 2006)User profiles/User ModelChallenges & Questions in Web Usage MiningClustering: HUNC MethodologyHierarchical Unsupervised Niche Clustering (HUNC) algorithm: a robust genetic clustering approach.

Hierarchical: clusters the data recursively and discovers profiles at increasing resolutions to allow finding even relatively small profiles/user segments.

Unsupervised: determines the number of clusters automatically.

Niching: maintains a diverse population in GA with members distributed among niches corresponding to multiple solutions smaller profiles can survive alongside bigger profiles.

Genetic optimization: evolves a population of candidate solutions through generations of competition and reproduction until convergence to one solution.180111010Hierarchical Unsupervised Niche Clustering Algorithm (H-UNC):Evolutionary Algorithm

Niching StrategyHill climbing performed in parallel with the evolution for estimating the niche sizes accurately and automaticallyUnsupervised Niche Clustering (UNC)20Unsupervised Niche Clustering (UNC)

0111010Role of Similarity Measure: Adding Semanticswe exploit both an implicit taxonomy as inferred from the website directory structure (we get this by tokenizing URL), an explicit taxonomy as inferred from external data: Relation: object 1 is parent of object 2 etcBoth implicit and explicit taxonomy information are seamlessly incorporated into clustering via a specialized Web session similarity measure Similarity Measure

If site structure ignored cosine similarity Map NU URLs on site to indices User session vector s(i) : temporally compact sequence of Web accesses by a userTaking site structure into account relate distinct URLs

: path from root to URLs node

23Multi-faceted Web user profilesData MiningPost-processingWeb LogsServer content DBviewed Web pages?search queries

from whichcompanies?about which companies?Pre-processingExample: Web Usage Mining of January 05 Log DataTotal Number of Users: 3333 26 profiles25 Some are: Highway Customers/Austria Facility Naval Education And Training Program Management/United StatesMagna International Inc/Canada Norfolk Naval Shipyard (NNSY)./FRAMES.ASPX/MENU_FILE=.JS&MAIN=/UNIVERSAL.ASPX/TOP_FRAME.ASPX/LEFT1.HTM/MAINPAGE_FRAMESET.ASPX/MENU_FILE=.JS&MAIN=/UNIVERSAL.ASPX/MENU.ASPX/MENU_FILE=.JS/UNIVERSAL.ASPX/ID=Company Connection/Manufacturers / Suppliers/Soda Blasting/UNIVERSAL.ASPX/ID= Surface Treatment/Surface Preparation/Abrasive BlastingWeb Usage Mining of January 05 ContWho are these Users?Total Number of Users:3333What web pages did these users visit?What did theysearch for before visiting our website? What page did they visit before starting their session?

Some are:shell blastingGavlon Industriesobrien paintsInduron Coatingsepoxy polyamide Some are:http://www.eniro.se/query?q=www.nns.com&what=se&hpp=&ax= http://www.mamma.com/Mamma?qtype=0&query=storage+tanks+in+water+treatment+plant%2Bcorrosionhttp://www.comcast.net/qry/websearch?cmd=qry&safe=on&query=mesa+cathodic+protection&searchChoice=googlehttp://msxml.infospace.com/_1_2O1PUPH04B2YNZT__whenu.main/search/web/carboline.com.26From Profiles to PersonalizationCurrent SessionTwo-Step Recommender Systems Based on a Committee of Profile-Specific URL-Predictor Neural NetworksStep 1:Find closest Profile discovered by HUNCCurrent SessionTwo-Step Recommender Systems Based on a Committee of Profile-Specific URL-Predictor Neural NetworksNN #1NN #2NN #3.etcNN # nStep 1:Find closest profileCurrent Session?Step 2: Choose its specialized network

Two-Step Recommender Systems Based on a Committee of Profile-Specific URL-Predictor Neural NetworksNN #1NN #2NN #3.etcNN # nStep 1:Find closest profileCurrent Session?Step 2: Choose the specialized network

Recommen-dationsTwo-Step Recommender Systems Based on a Committee of Profile-Specific URL-Predictor Neural NetworksTrain each Neural Net to Complete missing puzzles (sessions from that profile/cluster)complete session ? ? ?Striped = Input session (incomplete) to the neural network? = Output that is predicted to complete the puzzlePrecision: Comparison 2-step, 1-step, & K-NN (better for longer sessions)

Coverage: Comparison 2-step, 1-step, and & K-NN (better for longer sessions)

Semantic Personalization for Information Retrieval on an E-Learning Platform

E-Learning HyperManyMedia Resourceshttp://www.collegedegree.com/library/college-life/the_ultimate_guide_to_using_open_courseware

36ArchitectureSemantic representation ( knowledge representation), Algorithms (core software), and Personalization interface.

Semantic Search Engine

Experimental Evaluation A total of 1,406 lectures (documents) represented the profiles, with the size of each profile varying from one learner to another, as follows. Learner1 (English)= 86 lectures, Learner2 (Consumer and Family Sciences) = 74 lectures, Learner3 (Communication Disorders) = 160 lectures, Learner4 (Engineering) = 210 lectures, Learner5 (Architecture and Manufacturing Sciences) = 119 lectures, Learner6 (Math) = 374 lectures, Learner7 (Social Work) = 86 lectures, Learner8 (Chemistry) = 58 lectures, Learner9 (Accounting) = 107 lectures, and Learner10 (History) = 132 lectures.

Handling Concept Drift / Evolving Data StreamsConcept drift / Data StreamsNasraoui: Web Usage Mining & Personalization in Noisy, Dynamic, and Ambiguous EnvironmentsRecommender Systems in Dynamic Usage Environments

For massive Data streams, must use a stream mining frameworkFurthermore must be able to continuously mine evolving data streams TECNO-Streams: Tracking Evolving Clusters in Noisy StreamsInspired by the immune systemImmune system: interaction between external agents (antigens) and immune memory (B-cells)Artificial immune system:Antigens = data streamB-cells = cluster/profile stream synopsis = evolving memoryB-cells have an age (since their creation) Gradual forgetting of older B-cellsB-cells compete to survive by cloning multiple copies of themselvesCloning is proportional to the B-cell stimulationB-cell stimulation: defined as density criterion of data around a profile (this is what is being optimized!)O. Nasraoui, C. Cardona, C. Rojas, and F. Gonzalez. Mining Evolving User Profiles in Noisy Web Clickstream Data with a Scalable Immune System Clustering Algorithm, in Proc. of WebKDD 2003, Washington DC, Aug. 2003, 71-81.The Immune Network MemoryExternal antigen (RED) stimulates binding B-cell B-cell (GREEN) clones copies of itself (PINK)Even after external antigen disappears:B-cells co-stimulate each other thus sustaining each other Memory!Stimulation breeds SurvivalGeneral Architecture of Proposed Approach1-Pass Adaptive Immune LearningEvolving Immune Network (compressed into subnetworks)Evolving data stream Immune network information systemStimulation (competition & memory)Age (old vs. new)Outliers (based on activation)?SInitialize ImmuNet and MaxLimitPresent NEW antigen dataUpdate subNet* s ARBs stimulationsClone and Mutate ARBsTrap Initial DataCompress ImmuNetCompress ImmuNet into K subNetsCompute soft activations in subNet* Update subNet* s ARB Influence range /scaleIdentify nearest subNet* Kill lethal ARBs#ARBs > MaxLimit?Kill extra ARBs (based on age/stimulation strategy) OR increase acuteness of competition OR Move oldest patterns to aux. storageMemory Constraints

NoStart/ResetSecondary storageOutlier?ImmuNet Stats & VisualizationActivates ImmuNet?Clone antigenYesNoYesDomain Knowledge Constraints

Summarizing noisy data in 1 passDifferent Levels of Noise

Validation for clustering streams in 1 pass:- Detect all clusters ( miss no cluster) - do not discover spurious clustersValidation measures (averaged over 10 runs): Hits, Spurious Clusters

Experimental ResultsArbitrary Shaped Clusters

Experimental ResultsArbitrary Shaped Clusters

Concept drift / Data StreamsValidation Methodology in Dynamic EnvironmentsMild Changes: F1 versus session number, 1.7K sessions

TECNO-Streams higher (noisy, naturally occurring but unexpected fluctuation in user access patterns)Memory capacity limited to 30 nodes in TECNO-Streams synopsis, 30 KNN-instancesConcept drift / Data StreamsTRAC-STREAMS:Robust weights: decrease in distance & in time since data arrivedcriterion density = weights * distances/scale - weightsAdaptive Robust Weight with gradual forgetting of Old patterns

Objective function:

First term: Robust sum of normalized errorsSecond term: robust soft count of inliers/good points that are not noiseTotal Effect: minimize robust error, while including as many non-noise points as possible To maximize robustness & efficiency from robust estimation point of view

DetailsIncremental location (center) update:

Incremental scale update:

Incremental Updates: J/c = 0, J/ = 0, - Centers & scale: updated with every new data point from the stream- Cluster parameters (c, ) slowly forget older data, and track new dataChebyshev test: to determine whether a new data record is an outlierTo determine whether two clusters should be mergedResults in 1 pass over noisy data stream:- Number of clusters determined automatically- Clusters vary in size and density- Noisy data- Scales are automatically estimated in 1 pass

Mining Illegal Contraband Exchanges on Peer to Peer Networks

P2PInformation is exchanged in a decentralized mannerAttractive platform for participants in contraband information exchange due to the sense of anonymity that prevails while performing P2P transactions. P2P networks vary in the level of preserving the anonymity of their users. Information Exchange in a broadcast-based (unstructured) P2P retrieval:Query is initiated at a starting node,then it is relayed from this node to all nodes in its neighborhood, and so onuntil a matching information is found on a peer node, Finally, a direct communication is established from the initiating node and the last node to transfer the content. As query propagates on its way, each node does not know whether:Neighboring node is asking for info for itselfOr simply transmitting/broadcasting another nodes query Anonymity Illegal exchanges: e.g. child pornography materialPossible Roles of NodesSteps in Mining P2P NetworksUndercover Node-based Probing and Monitoring: (next slide) to Build an Approximate Model of Network ActivityFlagging Contraband Content:key word, hashes, other patternsEvaluation against different scenarios:recipient querying, distribution and routing casesUsing the Evaluation results to fine-tune the node positioning strategy

Reconstructing Illegal ExchangesEach undercover node capture a local part of the exchangeAll nodes collective information attempt to reconstruct a bigger picture of the exchangeInfer most likely contraband query initiators (or recipients, thus type C) and content distributors (thus type A)==> Suspect Rankings of the nodes How to perform the above inference?relational probabilistic inference , or network influence propagation mechanisms where evidence is gradually propagated in several iterations via the links from node to node. Belief Propagation algorithms that rely on message passing from each node to its neighbors

Challenges: Keeping up to date with the dynamic nature of the network (as new nodes leave and new nodes join)Aim for a close (yet approximate) snapshot to reality in a statistical sense, particularly using a large number of concurrent crawlers in parallel.

Undercover Node-based Probing: Positioning nodesDistributed crawling of P2P networks to: collect logical topology (i.e. virtual connections between nodes that determine who are the neighbors accessible to each node when passing a message) network topology (i.e. actual network identifiers such as IP address of a node)Analysis of correlations between the logical and network topologies by treating network topologies (e.g. top domain) as class information,by clustering the logical topology, and then comparing the class versus cluster entropies using for e.g. the information gain measure.Using the above results to validate assumptions about the relationships between logical and network topologies.Using the results to refine node positioning in our proposed Node-based Probing and Monitoring algorithm in the next stepMining Coronal Loops Created from Hot Plasma Eruptions on the Surface of the SunNasraoui & Schmelz: Mining Solar Images to Support Astrophysics ResearchMining Solar Images to Support Astrophysics Research Olfa NasraouiComputer Engineering & Computer ScienceUniversity of [email protected]

In collaboration with

Joan SchmelzDepartment of PhysicsUniversity of [email protected]

Acknowledgement: team members who worked on this project: Nurcan Durak, Sofiane Sellah, Heba Elgazzar, Carlos Rojas (Univ. of Louisville)Jonatan Gomez and Fabio Gonzalez (National Univ. of Colombia) Jennifer Roames, Kaouther Nasraoui (Univ. of Memphis)

NASA-AISRP PI Meeting, Univ. Maryland, Oct. 3-5 2006

NASA-AISRP PI Meeting, Univ. Maryland, Oct. 3-5 2006

Nasraoui & Schmelz: Mining Solar Images to Support Astrophysics ResearchMotivations (1): The Coronal Heating Problem The question of why the solar corona is so hot remains one of the most exciting astronomy puzzles for the last 60 years.

Temperature increases very steeply from 6000 degrees in photosphere (visible surface of the Sun) to a few million degrees in the corona (region 500 kilometers above the photosphere).

Even though the Sun is hotter on the inside than it is on the outside. The outer atmosphere of the Sun (the corona) is indeed hotter than the underlying photosphere!

Measurements of the temperature distribution along the coronal loop length can be used to support or eliminate various classes of coronal temperature models.

Scientific analysis requires data observed by instruments such as EIT, TRACE, and SXT.

NASA-AISRP PI Meeting, Univ. Maryland, Oct. 3-5 2006

Motivations (2): Finding Needles in Haystacks (manually)The biggest obstacle to completing the coronal temperature analysis task is collecting the right data (manually).

The search for interesting images (with coronal loops) is by far the most time consuming aspect of this coronal temperature analysis.

Currently, this process is performed manually. It is therefore extremely tedious, and hinders the progress of science in this field.

The next generation "EIT" called MAGRITE, scheduled for launch in a few years on NASA's Solar Dynamics Observatory, should be able to takeas many images in about four daysas was taken by EIT over 6 years!

and will no doubt need state of the art techniques to sift through the massive data to support scientific discoveries NASA-AISRP PI Meeting, Univ. Maryland, Oct. 3-5 2006

Goals of the project: Finding Needles in Haystacks (automatically)Develop an image retrieval system based on Data Mining to quickly sift through data sets downloaded from online solar image databases and automatically discover the rare but interesting images containing solar loops, which are essential in studies of the Coronal Heating Problem

Publishing mined knowledge on the web in an easily exchangeable format for astronomers.Sources of DataEIT: Extreme UV Imaging Telescope aboard the NASA/European Space Agency spacecraft called SOHO (Solar and Heliospheric Observatory)http://umbra.nascom.nasa.gov/eit

TRACE: NASAs Transition Region And Coronal Explorerhttp://vestige/lmsal.com/TRACE.SXT

SXT: Soft X-ray Telescope database on the Japanese spacecraft Yohkoh: http://ydac.mssl.ucl.ac.uk/ydac/sxt/sfm-cal-top.html

NASA-AISRP PI Meeting, Univ. Maryland, Oct. 3-5 2006

Samples of Data: EIT

NASA-AISRP PI Meeting, Univ. Maryland, Oct. 3-5 2006

NASA-AISRP PI Meeting, Univ. Maryland, Oct. 3-5 2006

StepsSample Image Acquisition and Labeling: images with and without solar loops, 1020 X 1022 ~ 2 MB / imageImage Preprocessing, Block Extraction, and Feature extractionBuilding & Evaluating Classification ModelsAt block level (is a block a loop or no-loop block?)10-fold cross validation Train, then test on independent set 10 times, average resultsAt image level (does an image contain a loop block?)Use model learned from training dataOne global model, or 1 model/solar cycleTest on independent set of images from different solar cycles1.

Step 2. Image Preprocessing and Block ExtractionDespeckling (to clean noise) and Gradient Transformation (to bring out the edges)Phase I (loops out of solar disk): divide area outside solar disk into blocks with an optimal size (to maximize overlap with marked areas over all training images)Use each block as one data record to extract individual data attributes for learning and testing

Difficult classification problem

Loops come in different sizes, shapes, intensities, etc

Hardly distinguishable regions without interesting loopsInconsistencies in labeling are commonSubjectivity, quality of data

Even at edge level: challengingWhich block is NOT a loop block?

Even at edge level: challengingWhich block is NOT a loop block?

Defective and Asymmetric nature of Loop Shapes

Features inside each block - applied on the original intensity levelsStatistical FeaturesMeanStandard DeviationSmoothnessThird MomentUniformityEntropy

Features inside each block - applied on edgesHough-based FeaturesFirst apply Hough transformImage space Hough Space (H.S.)Pixel parameter combination for a given shapeAll pixels vote for several parameter combinationsExtract peaks from H.S.

Then construct features based on H.S.Peak detection is very challenging:Many false peaks (noise)Bin splitting (peaks are split)Biggest problem: size of Hough accumulator array:Every pixel votes for all possible curves that go trough this pixelCombinatorial explosion as we add more parametersSolution: we feed the Hough space into a stream clustering algorithm to detect peaks

TRAC-Stream clusteringeliminate need to store Hough accumulator array by processing it in 1 passInput: initial scales s0, max. No. of clustersOutput: running (real-time) synopsis of clusters in input streamRepeat until end of stream {Input next data point xFor each cluster in current synopsis {Perform Chebyshev test (test for compatibility without any assumptions on distributions, but requires robust scale estimates)If x passes Chebyshev test ThenUpdate cluster parameters: centroid, scale}If no cluster or x fails all Chebyshev tests ThenCreate new cluster (c=x, s= s0)Perform pairwise Chebyshev tests to merge compatible clusters: densest cluster absorbs merged clusterCentroid updatedEliminate clusters with low density}

Examples of clustering 2-D Hough space

Curvature Features Original ImageAutomatically Detected Curves

Step 3. ClassificationClassification algorithm Type of classifier Decision Stump Decision Tree Induction & PruningC4.5Decision Tree Induction AdaboostDecision Tree w/ Boosting RepTree Tree Induction Conjunctive Rule Rule Generation Decision Table Rule Generation PART Rule Generation JRip Rule Generation 1-NN Lazy 3-NN Lazy SVM (Support Vector Machines) Function based Multi-Layer Perceptron (Neural Network) Function based Naive Bayes (Bayesian classifier)Probabilistic NASA-AISRP PI Meeting, Univ. Maryland, Oct. 3-5 2006

Nasraoui & Schmelz: Mining Solar Images to Support Astrophysics ResearchMonthTotal Viewed ImagesLoop Images# Loops on limb# Loops on diskWavelengthRatio of limb loops / viewed imgsRatio of limb loops /All loops:Aug.96376331713/37 = 8%

3/6=50%Mar.001156151711/115= 0.8%1/6=16%Dec. 00123183151713/123= 2%3/18=1/6=16%Feb. 04111171161711/111= 0.9%1/17=5%Jan. 05372481533171,195,284,30415/372= 4%15/48=31%Jun. 05360181713171,195,284,30417/360= 4%17/18=94%Jun. 96113142121712/113= 1%2/14=14%Dec. 961671622313917123/167= 13%23/162=14%Mar. 97154313281713/154= 2%3/31=9%Block-based resultsFeatures /ClassifierStatisticalHough-basedSpatialCurvatureAll FeaturesPre.Rec.Pre.Rec.Pre.Rec.PreRec.Pre.Rec.AdaBoost0.39 0.256 0.464 0.40.458 0.4090.416 0.3250.643 0.596 NB0.193 0.057 0.398 0.6180.437 0.5010.289 0.8010.404 0.573 MLP0.484 0.2180.463 0.4570.463 0.1390.442 0.30.6 0.638 C4.50.459 0.2660.441 0.3620.481 0.4340.419 0.2180.548 0.536 RIPPER0.515 0.1660.487 0.3250.498 0.360.448 0.2330.591 0.613 K-NN (k=5)0.431 0.2010.472 0.3570.479 0.310.377 0.2080.641 0.462150 solar images from 1996, 1997, 2000, 2001, 2004 2005403 Loop blocks 7950 No-loop blocks Loop Mining ToolNasraoui & Schmelz: Mining Solar Images to Support Astrophysics Research

Image Based Testing ResultsMinimum CycleMedium CycleMaximum CycleAll CyclesPre.Rec.Pre.Rec.Pre.Rec.Pre.Rec.0.50.90.80.70.810.880.83 ActualPredictedLoop ImagesNo-Loop ImagesTotalLoop Images401050No-Loop Images113950Total5149100Conclusions More DATA Data MINING FASTER Data Arrival Rates (Massive Data) STREAM Data Mining Many BENEFITS: from DATA to KNOWLEDGE!!!Better KNOWLEDGE better DECISIONS help society: E.g. Scientists, Businesses, Students, Casual Web Surfer, Cancer Patients, Children Many RISKS: Privacy, Ethics, Legal Implications:Might anything that we do harm certain people?Who will use a certain Data Mining tool?What if certain Govts. use it to catch political dissidents?Who will benefit from DM?FAIRNESS: E.g. shouldnt businesses/websites reward Web surfers in return for the wealth of data/user profiles that their clickstreams provide???Questions/Comments/Discussions????disksUnitsCapacity PBs199589,054104.81996105,686183.91997129,281343.631998143,649724.361999165,8571394.62000187,8352553.72001212,80046412002239,13881192003268,227130271995104.81996183.91997343.631998724.3619991394.620002553.72001464120028119200313027

disks000000000

chart data gap26535105700272292274002724542533027309891970259531727000

chart data gap 2265351057002722933310027245758430273091650400259533377400

data gapPh.D.PetabytesTerabytesTotal TBsPBs1995105.7105700105700105.71996227.4227400333100333.11997425.33425330758430758.431998891.9789197016504001650.419991727172700033774003377.420005792579200091694009169.41990199119921993199419951996199719981999Science and engineering Ph.D.s, total22,86824,02324,67525,44326,20526,53527,22927,24527,30925,9531057003331007584301650400337740010570033310075843016504003377400

Sheet3

Z

X

Undercover (Y)

Z

undercover (Y)

Normal or UndercoverRecipient (initiates query)

query

query

query

query

query

query

query

query

query

query

X

query

query

query

query

query

Distributor

query

found

found

found

found