Upload
zayne-roseberry
View
215
Download
0
Embed Size (px)
Citation preview
Ernestina Menasalvas Ruiz Pedro Sousa
DATA MINING
GOAL
• Extract knowledge from aviation data sources to obtain patterns that help detection of incidents
Learn behaviour models
What is Data Mining?
• Many Definitions– Non-trivial extraction of implicit, previously unknown and
potentially useful information from data– Exploration & analysis, by automatic or
semi-automatic means, of large quantities of data in order to discover meaningful patterns
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3
KDD process
….
CRISP-DM (www.crispdm.org)
BusinesUnderstanding
DataUnderstanding
Data Preparation
Model
EvaluateARSS
fleet
Challenges
• Data integration• Aircraft information• Context: sensors, space weather, location, weather• Operations: pre-flight, departure, climb, enroute, arrival,
taxing, post-flight• Aviation safety reports
• Dynamic and complex data:– theoretical and practical aspects of the algorithms have
to be analyzed to discover the most appropriate techniques:
• trend analysis, association of events, datastream methods, context integration, resource awareness
GOAL (cont)
• apply algorithms to mine the various data sources for information– to identify patterns:
• atypical flights,• anomalous cockpit procedures • Groups of safety reports
• BUT:– KDD is a process
• Static vs dynamic
KDD process
Aprox. 80% effort
Data Exploration and transformation
• Exploration of the data to better understand its characteristics.– Helping to select the right tool for preprocessing or analysis– Making use of humans’ abilities to recognize patterns– Integrate semantic of data
– Clustering and anomaly detection will be used as exploratory techniques
• Transform data prior to mining so to be able to extract the useful patterns
Data Mining Tasks
• Prediction (Supervised learning)– Use some historical information to learn a model that can
help to predict unknown or future values of some variable.– Base for forecasting
• Classification• Regression • Deviation Detection
• Description (Unsupervised)– Find patterns that describe the data– Clustering – Association Rule Discovery – Sequential Pattern Discovery
Classification
• Given a collection of records in which the class is known: – Find a model able to describe the class given values of the rest of
attributes.• Measurements have to be used to validate the model and
determine accuracy of prediction– Train and test
• Techniques– Induction tree
• C4.5 , ID3• Very effcients if we look at the execution time• Very intuitive results
– Neural networks• The result is a neural network: black box• Robust• No intuitive
Clustering
• Given a set of records (unclassified), group records in such a way that:– records in one cluster are more similar to one another.– records in separate clusters are less similar to one another.
• Similarity Measures have to be defined:– Special attention to distance understanding
• Approaches– Divisive Algorithms: They first build different partitions and then
these partitions are evaluated:• K-means
– Hierarchical: They build a hierarchical descomposition – Density based: density functions are used – Kohonen networks [Kohonen ‘95]
Association Rule Discovery
• Given a set of records described by a set of attributes:– Find associations in values of attributes– Once associations are discovered, rules can be obtained– Confidence vs support .– Apriori Algoritm
At1=1 and At3=1 and At4=1At1 At2 At3 At4 At5 At6 At7
0 1 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0
Challenges of the algorithms
• Algorithm to find anomalies in large dataset :– be fast – scalable. – Accurate
• Algorithms have to be able to deal with:– continuous sequences, representing sensor data
such as airspeed and altitude– discrete sequences, such as sequences of pilot
switch presses.
Data streams vs static data
Data streams
Challenges into algorithms:- Processing data in a single pass.- Generation models in an
incremental way.- Ability to detect model changes
over time.- Limit usage of memory and
computing time.- Possibility of automating the
evaluation process.
A data stream:
- is potentially unbound in size
- needs to be analyzed over
time
- arrives at very high rate
- and its undelying model
evolves over time
[Aggarwal et al.] “Data Streams: Models and Algorithms”. Advances in Database Systems, Springer, 2007[Aguilar-Ruiz, Gama] “Data Streams”. Journal of UniversalComputer Science , 2005[Barbará] “Requirements for clustering data streams”. SIGKDD’02.
Goal
• New challenges introduced by evolving data like:– resource aware learning, – change detection,– novelty detection– important application areas where data evolution
must be taken into account– how learning under constraints (time, storage
capacity and other resources) is affected by data evolution
– how context can help learning process
Change and concept drift
tim e
mea
nsudden d rift
tim e
mea
ngradua l d rift
tim e
mea
nincrem enta l d rift
tim e
mea
nreoccurring con texts
[Joao Gama 2010]
Concept drift: the underlying concept may shift unexpectedly from time to time.• Changes appear:
• Adversary actions• Varying personal interest• Changing population• Complex environment
Required features
• Examples have to be processed as they arrive• Each example should be processed:
– Small constant time– Fixed amount of main memory– Single scan of the data– Without (or reduced) revisit old records.
• Produce models equivalent to the one that would be obtained by a batch data-mining algorithm
• Detect and react to concept drift
[Joao Gama 2010]
Recurrent concepts
• Many learning algorithms to deal with concept drift – Based on: time windows, ensembles, drift
detection.– FLORA, SEA, DWM, DMM, ...
• What about Recurrent concepts? – Particular type of concept drift.– Fogetting mechanisms, past data and models are
discarded.– However, its common for concepts to reappear.
Context and data stream
Context• Context representation:• Context similarity:
numeric:
nominal:
Context integration• We want to integrate context information with
previously learned models.
• freqC is the most frequent Context in a sequence of context states {C1, C2, ... Cn}
• Concept history with associated context. h(Mk|Ci)
• Estimate that Mk represents the current underlying concept given the current context.
Model Storage
• Model storage for a model Mk:• the period k where the model was used.• using NB requires storing the CV• the frequent context freqC for period k.• accuracy of the model when it was in use.
• Represented as the tuple:
Model Retrieval• Model retrieval for a model Mk:
– using a sample Sn of recent records,
– compute the MSE for Mk
– get the freqC for Sn
– use history h(Mk|freqC)
• The utility is defined based on model accuracy (highest) and with context similar (min distance) to the current one.
• Retrieve the model with highest utility as:
CALDS: learning process
• Incrementally Learn the underlying concept• When warning is signaled:
• Prepare a new base learner for the possible new concept• Anticipate to drift
• When drift is detected:• Store the current model• Reuse a previously learned model when the underlying
concept is recurrent.
CALDS: learning process
Improvements integrating context
Overall accuracy: 72.5 %; 69,6%; 62,2%
SOME ALREADY PREVIOUS EXPERIENCE
Other current applications• ESA- European Space Agency
– Event Reporting Tool for non-manned satellite passes (Cryosat monitoring)
31
current applications• ESA- European Space Agency / Galileo Industries
– Galileo - Ground Control Segment Central Monitoring & Control Facility
32
Some current applications• Portuguese Navy
– Singrar – Integrated System for Ship Repair and Resource allocation
33
The process
34
Integrated Risk
Plans Activation / Maintenance
Drillings Training
ApplicationInput
Space Weather
Why – Space Weather?
• To protect systems and people that might be at risk from space weather effects, we need to understand the causes of space weather.
Space Weather Decision Support System
• SWDSS Third project financed by the European Space Agency (ESA) about SW
• SWDSS main objective is to develop software capable of storing, manipulating and reacting to adverse Space Weather situations in spacecrafts:
. Providing tools for analyzing the collected data;
. Supplying reporting facilities for systems management;
. Supplying a knowledge discovery tool for nowcast, forecast and data mining.
Data sources and providers
• Mission’s telemetry (payload and/or housekeeping) data and processed data
• Mission’s auxiliary data, e.g. orbital coordinates, apogee and perigee crossings, station coverage and hand-over, events, 3D models, metadata
• Data available from other sources, e.g.NOAA, SIDC, SWENET, National Agencies
• Data from ground-based measurements
Satellite Monitoring
Conclusion
• Huge amount of aviation data1. Integrate data (micro and macro level)2. Enrich data with semantics3. Map data with technique to discover patterns (static and
streams) :1. Anomalities2. predictive3. Sequences4. Context influence
• Data mining in other similar domains has obtained results
• Next step: data mining for aviation safety
Ernestina Menasalvas RuizPedro Sousa
THANKS