Ernestina Menasalvas Ruiz Pedro Sousa. GOAL Extract knowledge from aviation data sources to obtain patterns that help detection of incidents Learn behaviour

Ernestina Menasalvas Ruiz Pedro Sousa

DATA MINING

GOAL

• Extract knowledge from aviation data sources to obtain patterns that help detection of incidents

Learn behaviour models

What is Data Mining?

• Many Definitions– Non-trivial extraction of implicit, previously unknown and

potentially useful information from data– Exploration & analysis, by automatic or

semi-automatic means, of large quantities of data in order to discover meaningful patterns

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

KDD process

….

CRISP-DM (www.crispdm.org)

BusinesUnderstanding

DataUnderstanding

Data Preparation

Model

EvaluateARSS

fleet

Challenges

• Data integration• Aircraft information• Context: sensors, space weather, location, weather• Operations: pre-flight, departure, climb, enroute, arrival,

taxing, post-flight• Aviation safety reports

• Dynamic and complex data:– theoretical and practical aspects of the algorithms have

to be analyzed to discover the most appropriate techniques:

• trend analysis, association of events, datastream methods, context integration, resource awareness

GOAL (cont)

• apply algorithms to mine the various data sources for information– to identify patterns:

• atypical flights,• anomalous cockpit procedures • Groups of safety reports

• BUT:– KDD is a process

• Static vs dynamic

KDD process

Aprox. 80% effort

Data Exploration and transformation

• Exploration of the data to better understand its characteristics.– Helping to select the right tool for preprocessing or analysis– Making use of humans’ abilities to recognize patterns– Integrate semantic of data

– Clustering and anomaly detection will be used as exploratory techniques

• Transform data prior to mining so to be able to extract the useful patterns

Data Mining Tasks

• Prediction (Supervised learning)– Use some historical information to learn a model that can

help to predict unknown or future values of some variable.– Base for forecasting

• Classification• Regression • Deviation Detection

• Description (Unsupervised)– Find patterns that describe the data– Clustering – Association Rule Discovery – Sequential Pattern Discovery

Classification

• Given a collection of records in which the class is known: – Find a model able to describe the class given values of the rest of

attributes.• Measurements have to be used to validate the model and

determine accuracy of prediction– Train and test

• Techniques– Induction tree

• C4.5 , ID3• Very effcients if we look at the execution time• Very intuitive results

– Neural networks• The result is a neural network: black box• Robust• No intuitive

Clustering

• Given a set of records (unclassified), group records in such a way that:– records in one cluster are more similar to one another.– records in separate clusters are less similar to one another.

• Similarity Measures have to be defined:– Special attention to distance understanding

• Approaches– Divisive Algorithms: They first build different partitions and then

these partitions are evaluated:• K-means

– Hierarchical: They build a hierarchical descomposition – Density based: density functions are used – Kohonen networks [Kohonen ‘95]

Association Rule Discovery

• Given a set of records described by a set of attributes:– Find associations in values of attributes– Once associations are discovered, rules can be obtained– Confidence vs support .– Apriori Algoritm

At1=1 and At3=1 and At4=1At1 At2 At3 At4 At5 At6 At7

0 1 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0

Challenges of the algorithms

• Algorithm to find anomalies in large dataset :– be fast – scalable. – Accurate

• Algorithms have to be able to deal with:– continuous sequences, representing sensor data

such as airspeed and altitude– discrete sequences, such as sequences of pilot

switch presses.

Data streams vs static data

Data streams

Challenges into algorithms:- Processing data in a single pass.- Generation models in an

incremental way.- Ability to detect model changes

over time.- Limit usage of memory and

computing time.- Possibility of automating the

evaluation process.

A data stream:

- is potentially unbound in size

- needs to be analyzed over

time

- arrives at very high rate

- and its undelying model

evolves over time

[Aggarwal et al.] “Data Streams: Models and Algorithms”. Advances in Database Systems, Springer, 2007[Aguilar-Ruiz, Gama] “Data Streams”. Journal of UniversalComputer Science , 2005[Barbará] “Requirements for clustering data streams”. SIGKDD’02.

Goal

• New challenges introduced by evolving data like:– resource aware learning, – change detection,– novelty detection– important application areas where data evolution

must be taken into account– how learning under constraints (time, storage

capacity and other resources) is affected by data evolution

– how context can help learning process

Change and concept drift

tim e

mea

nsudden d rift

tim e

mea

ngradua l d rift

tim e

mea

nincrem enta l d rift

tim e

mea

nreoccurring con texts

[Joao Gama 2010]

Concept drift: the underlying concept may shift unexpectedly from time to time.• Changes appear:

• Adversary actions• Varying personal interest• Changing population• Complex environment

Required features

• Examples have to be processed as they arrive• Each example should be processed:

– Small constant time– Fixed amount of main memory– Single scan of the data– Without (or reduced) revisit old records.

• Produce models equivalent to the one that would be obtained by a batch data-mining algorithm

• Detect and react to concept drift

[Joao Gama 2010]

Recurrent concepts

• Many learning algorithms to deal with concept drift – Based on: time windows, ensembles, drift

detection.– FLORA, SEA, DWM, DMM, ...

• What about Recurrent concepts? – Particular type of concept drift.– Fogetting mechanisms, past data and models are

discarded.– However, its common for concepts to reappear.

Context and data stream

Context• Context representation:• Context similarity:

numeric:

nominal:

Context integration• We want to integrate context information with

previously learned models.

• freqC is the most frequent Context in a sequence of context states {C1, C2, ... Cn}

• Concept history with associated context. h(Mk|Ci)

• Estimate that Mk represents the current underlying concept given the current context.

Model Storage

• Model storage for a model Mk:• the period k where the model was used.• using NB requires storing the CV• the frequent context freqC for period k.• accuracy of the model when it was in use.

• Represented as the tuple:

Model Retrieval• Model retrieval for a model Mk:

– using a sample Sn of recent records,

– compute the MSE for Mk

– get the freqC for Sn

– use history h(Mk|freqC)

• The utility is defined based on model accuracy (highest) and with context similar (min distance) to the current one.

• Retrieve the model with highest utility as:

CALDS: learning process

• Incrementally Learn the underlying concept• When warning is signaled:

• Prepare a new base learner for the possible new concept• Anticipate to drift

• When drift is detected:• Store the current model• Reuse a previously learned model when the underlying

concept is recurrent.

CALDS: learning process

Improvements integrating context

Overall accuracy: 72.5 %; 69,6%; 62,2%

SOME ALREADY PREVIOUS EXPERIENCE

Other current applications• ESA- European Space Agency

– Event Reporting Tool for non-manned satellite passes (Cryosat monitoring)

31

current applications• ESA- European Space Agency / Galileo Industries

– Galileo - Ground Control Segment Central Monitoring & Control Facility

32

Some current applications• Portuguese Navy

– Singrar – Integrated System for Ship Repair and Resource allocation

33

The process

34

Integrated Risk

Plans Activation / Maintenance

Drillings Training

ApplicationInput

Space Weather

Why – Space Weather?

• To protect systems and people that might be at risk from space weather effects, we need to understand the causes of space weather.

Pedro Sousa

Space Weather Decision Support System

• SWDSS Third project financed by the European Space Agency (ESA) about SW

• SWDSS main objective is to develop software capable of storing, manipulating and reacting to adverse Space Weather situations in spacecrafts:

. Providing tools for analyzing the collected data;

. Supplying reporting facilities for systems management;

. Supplying a knowledge discovery tool for nowcast, forecast and data mining.

Data sources and providers

• Mission’s telemetry (payload and/or housekeeping) data and processed data

• Mission’s auxiliary data, e.g. orbital coordinates, apogee and perigee crossings, station coverage and hand-over, events, 3D models, metadata

• Data available from other sources, e.g.NOAA, SIDC, SWENET, National Agencies

• Data from ground-based measurements

Satellite Monitoring

Conclusion

• Huge amount of aviation data1. Integrate data (micro and macro level)2. Enrich data with semantics3. Map data with technique to discover patterns (static and

streams) :1. Anomalities2. predictive3. Sequences4. Context influence

• Data mining in other similar domains has obtained results

• Next step: data mining for aviation safety

Ernestina Menasalvas RuizPedro Sousa

THANKS

Documents

Ernestina Menasalvas Ruiz Pedro Sousa. GOAL Extract knowledge from aviation data sources to obtain patterns that help detection of incidents Learn behaviour