(4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Data Mining MTAT.03.183(4AP 6EAP)(4AP = 6EAP)

Streams, time series,

Jaak ViloJaak Vilo

2009 Fall

Summary so farSummary so far

• Data preparation

• Machine learningMachine learning

• Statistics/significance

• Large data – algorithmics

• VisualisationVisualisation

• Queries/reporting, OLAP

• Different types of data

• Business value• Business valueJaak Vilo and other authors UT: Data Mining 2009 2

Streams time seriesStreams, time series

• Time

• Sequence order and positionSequence order and position

• Continuosly arriving data

Jaak Vilo and other authors UT: Data Mining 2009 3

WikipediaWikipedia

• Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery.


• In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values ofsome knowledge about the class membership or values of previous instances in the data stream. Machine learning techniques can be used to learn this prediction task from q plabeled examples in an automated fashion. In many applications, the distribution underlying the instances or the l d l h l b l h hrules underlying their labeling may change over time, i.e. the

goal of the prediction, the class to be predicted or the target value to be predicted may change over time This problem isvalue to be predicted, may change over time. This problem is referred to as concept drift.


SoftwareSoftware

• RapidMiner: free open‐source software for knowledge discovery, data mining, and g y gmachine learning also featuring data stream mining learning time‐varying concepts andmining, learning time varying concepts, and tracking drifting concept (if used in combination with its data stream miningcombination with its data stream mining plugin (formerly: concept drift plugin))


• MOA (Massive Online Analysis): free open‐source software specific for mining datap gstreams with concept drift. It contains a prequential evaluation method the EDDMprequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets and artificial stream generators asdatasets, and artificial stream generators asSEA concepts, STAGGER, rotating hyperplane, random tree, and random radius basedfunctions. MOA supports bi‐directionalinteraction with Weka (machine learning).


Literature on Stream MiningLiterature on Stream Mining

• http://www.csse.monash.edu.au/~mgaber/WResources.htm


Mining Data Streamsg

What is stream data? Why Stream Data Systems?

Stream data management systems: Issues and solutions Stream data management systems: Issues and solutions

Stream data cube and multidimensional OLAP analysis

Stream frequent pattern analysis

Stream classification

Stream cluster analysis Stream cluster analysis

Research issues

November 25, 2009 Data Mining: Concepts and Techniques 9

Characteristics of Data Streams

Data Streams Data Streams Data streams—continuous, ordered, changing, fast, huge amount

T di i l DBMS d d i fi i i dd Traditional DBMS—data stored in finite, persistent data setsdata sets

Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single scan algorithm (can only have

one look)one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in


Most stream data are at pretty low level or multi dimensional in nature, needs multi-level and multi-dimensional processing

Stream Data Applicationspp

Telecommunication calling recordsTelecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Network monitoring and traffic engineering Financial market: stock exchange

E i i & i d t i l l & Engineering & industrial processes: power supply & manufacturingS it i & ill id t RFID Sensor, monitoring & surveillance: video streams, RFIDs

Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too


expensive)

DBMS versus DSMSe sus

Persistent relations Transient streams Persistent relations

One-time queries

Random access

Transient streams

Continuous queries

Sequential access Random access

“Unbounded” disk store

Only current state matters

Sequential access

Bounded main memory

Historical data is important Only current state matters

No real-time services

Relatively low update rate

Historical data is important

Real-time requirements

Possibly multi-GB arrival rate Relatively low update rate

Data at any granularity

Assume precise data

Possibly multi GB arrival rate

Data at fine granularity

Data stale/imprecise Assume precise data

Access plan determined by query processor, physical DB

Data stale/imprecise

Unpredictable/variable data arrival and characteristics


q y p , p ydesign Ack. From Motwani’s PODS tutorial slides

Mining Data StreamsMining Data Streams

What is stream data? Why Stream Data Systems?

Stream data management systems: Issues and solutions Stream data management systems: Issues and solutions

Stream data cube and multidimensional OLAP analysis

Stream frequent pattern analysis

Stream classification

Stream cluster analysis Stream cluster analysis

Research issues


Architecture: Stream Query ProcessingQ y g

User/ApplicationUser/ApplicationSDMS (Stream Data Use / pp cat oUse / pp cat o

Continuous QueryContinuous Query

Management System)

Continuous QueryContinuous Query

ResultsResults

Stream QueryStream QueryProcessorProcessor

Multiple streamsMultiple streams

ProcessorProcessor

Scratch SpaceScratch Space


Scratch SpaceScratch Space(Main memory and/or Disk)(Main memory and/or Disk)

Stream Data Mining Tasks

On-Line analysis of streams Clustering data streams g Classification of data streams Mining frequent patterns in data streams g q p Mining sequential patterns in data streams Mining partial periodicity in data streams Mining outliers and unusual patterns in data

streams ……

Clustering on Streamsg

K-means - not suitable for stream miningnot suitable for stream mining

Clustream- assume shape of the cluster is always assume shape of the cluster is always

circle. Denstream Denstream

- detects arbitrary shape clusters in stream data.

Frequent Pattern Mining (FPM)i d in data streams

Frequent Pattern Mining (FPM) in data streams.Frequent (/hot/top) patterns:Items/Item sets/Sequences occurring, frequently in a database.

ISSUES-Limited memory

-Reading past data is impossible.

Question: How much is it justified to mine f l i d ??frequent pattern only in data stream??

Infrequent pattern mining

Objective:1. To find-out the abnormality , surprising or

“interesting” pattern in the data stream.2. Mutual pattern mining.3 Stream specific item set mining3. Stream specific item set mining.4. Association Rule mining among event of interest.

Application:1. Text mining.2 Distributed Sensor Networks2. Distributed Sensor Networks.3. Works well for evolving data stream.

Challenges in Stream Data Analysisg y

• Data Volume is Huge• Need to remember recent and historical data• Approaches to data reduction• Need single linear scan algorithms• Most existing algorithms and prototype systems are

memory and CPU bound and can only perform a single memory and CPU bound, and can only perform a single data mining function

• Desire to perform multiple analysis at the same timep p y• Occurrence of concept drifts where previous model is

no longer valid• Reduce the cost of learning where models need to be

updated and replacedRequire instant response• Require instant response

Loretta Auvil

Stream Data Reduction

• Challenges of “OLAP-ing” stream data• Challenges of OLAP ing stream data• Raw data cannot be stored• Simple aggregates are not powerful enoughSimple aggregates are not powerful enough• History shape and patterns at different levels are desirable

• MAIDS Unique Approach• A tilted time window to aggregate data at different points gg g p

in time• A scalable multi-dimensional stream data cube that can

t d l f t d t ffi i tl ith t aggregate a model of stream data efficiently without accessing the raw data

Loretta Auvil

MAIDS Approach: Tilted Time Windowpp

• Recent data is registered and weighted at a finer • Recent data is registered and weighted at a finer granularity than longer term data

• As the edge of a time widow is reached, the finer As the edge of a time widow is reached, the finer granularity data is summarized and propagated to a courser granularity

• Window is maintained automatically

24h 4qtrs 15 i t7d 3024hrs 4qtrs 15minutes7days 30sec

PastTime

Present

Loretta Auvil

MAIDS: Stream Mining Architectureg

MAIDS is aimed to:M S s a ed to:• Discover changes,

trends and evolution characteristics in data streams

• Construct clusters and classification models f d from data streams

• Explore frequent patterns and patterns and similarities among data streamsdata streams

Loretta Auvil

Features of MAIDS

• General purpose tool for data stream analysis• General purpose tool for data stream analysis• Processes high-rate and multi-dimensional data• Adopts a flexible tilted time window framework• Adopts a flexible tilted time window framework• Facilitates multi-dimensional analysis using a stream

cube architecturecube architecture• Integrates multiple data mining functions• Provides user-friendly interface: automatic analysis and • Provides user friendly interface: automatic analysis and

on-demand analysis• Facilitates setting alarms for monitoringFacilitates setting alarms for monitoring• Built in D2K as D2K modules and leveraged in the D2K

Streamline tool

Loretta Auvil

Statistics Query EngineQ y g

• Answers user queries on data statistics, such as, count, max, min a erage min, average, regression, etc.

U tilt d ti • Uses tilted time window

U ffi i t d t • Uses an efficient data structure, H-tree for partial computation partial computation of data cubes

Loretta Auvil

Stream Data Classifier

• Builds models to • Builds models to make predictions

Uses Naïve • Uses Naïve Bayesian Classifier with boosting

• Uses Tilted Time Uses Tilted Time Window to track time related info

• Sets alarm to monitor events

Loretta Auvil

Stream Pattern Finder

• Find frequent tt ith patterns with

multiple time granularities

• Keep precise/ compressed history in tilted time windowtilted time window

• Mine only the interested item set interested item set using FP-tree algorithm

• Mining evolution and dramatic changes of frequent patternsfrequent patterns

Loretta Auvil

Stream Data Clusteringg

T t i• Two stages: micro-clustering and macro-clustering

• Uses micro-clustering to do incremental, online processing and online processing and maintenance

• Uses tilted time frameUses tilted time frame

• Detects outliers when new clusters are formed

Loretta Auvil

Demonstration

Loretta Auvil

Significant Advances In the Areas of Data M t d Mi iManagement and Mining

• Tilted-time window for multi-resolution modelingM l i di i l l i i b hi• Multi-dimensional analysis using a stream cube architecture

• Efficient “one-look” stream data mining algorithms:• classification, frequent pattern analysis, clustering, and , q p y , g,

information visualization• Integration of “one-look” approaches into one stream data mining

platform so they can cooperate to discover patterns and surprising platform so they can cooperate to discover patterns and surprising events in real-time

• Internationally recognized research leadership in the areas of data management mining and knowledge sharingmanagement, mining, and knowledge sharing

• Experience in development of robust software framework supporting advanced, data mining and information visualizationExperience in development of software environments supporting • Experience in development of software environments supporting problem solving and evidence-based decision making

Loretta Auvil

Knowledge Extraction from Streaming Text g g

Information extractionprocess of using advanced • process of using advanced automated machine learning approaches

• to identify entities in text • to identify entities in text documents

• extract this information along with the relationships these pentities may have in the text documents

Thi j t d t t This project demonstrates information extraction of names, places and organizations from real-time organizations from real time news feeds. As news articles arrive, the information is extracted and displayed.

Loretta Auvil

Challenges of Stream Data Processing

Multiple, continuous, rapid, time-varying, ordered streamsp , , p , y g,

Main memory computations

Queries are often continuous Queries are often continuous Evaluated continuously as stream data arrives

Answer updated over time Answer updated over time

Queries are often complex Beyond element-at-a-time processing

Beyond stream-at-a-time processing

Beyond relational queries (scientific, data mining, OLAP)

Multi-level/multi-dimensional processing and data mining


Most stream data are at low-level or multi-dimensional in nature

Processing Stream Queriesg Q

Query types One-time query vs. continuous query (being evaluated

continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line)

Unbounded memory requirements For real-time response, main memory algorithm should be used Memory requirement is unbounded if one will join future tuples

Approximate query answering With bounded memory, it is not always possible to produce exact

answersanswers High-quality approximate answers are desired Data reduction and synopsis construction methods


Data reduction and synopsis construction methods Sketches, random sampling, histograms, wavelets, etc.

Methodologies for Stream Data Processing

Major challenges Keep track of a large universe, e.g., pairs of IP address, not ages

MethodologySynopses (trade off between accuracy and storage) Synopses (trade-off between accuracy and storage)

Use synopsis data structure, much smaller (O(logk N) space) than their base data set (O(N) space)

Compute an approximate answer within a small error range(factor ε of the actual answer)

Major methods Major methods Random sampling Histograms Sliding windows Multi-resolution model Sketches


Sketches Radomized algorithms

Stream Data Processing Methods (1)

Random sampling (but without knowing the total length in advance) Reservoir sampling: maintain a set of s candidates in the reservoir,

which form a true random sample of the element seen so far in the stream. As the data stream flow, every new element has a certain st ea s t e data st ea o , e e y e e e e t as a ce taprobability (s/N) of replacing an old element in the reservoir.

Sliding windowsM k d i i b d l t d t f lidi i d i Make decisions based only on recent data of sliding window size w

An element arriving at time t expires at time t + w Histograms Histograms

Approximate the frequency distribution of element values in a stream Partition data into a set of contiguous buckets Equal-width (equal value range for buckets) vs. V-optimal (minimizing

frequency variance within each bucket) Multi-resolution models


Multi resolution models Popular models: balanced binary trees, micro-clusters, and wavelets

Stream Data Processing Methods (2)g ( ) Sketches

Hi t d l t i lti th d t b t k t h Histograms and wavelets require multi-passes over the data but sketches can operate in a single pass

Frequency moments of a stream A = {a1, …, aN}, Fk:

v

i

kik mF

1

where v: the universe or domain size, mi: the frequency of i in the sequence

Given N elts and v values, sketches can approximate F0, F1, F2 in O(log v + log N) spaceO(log v + log N) space

Randomized algorithms Monte Carlo algorithm: bound on running time but may not return correct

result Chebyshev’s inequality:

Let X be a random variable with mean μ and standard deviation σ2

2

)|(|k

kXP Let X be a random variable with mean μ and standard deviation σ

Chernoff bound: Let X be the sum of independent Poisson trials X1, …, Xn, δ in (0, 1]

4/2

|])1([ eXP


The probability decreases expoentially as we move from the mean

Approximate Query Answering in Streamspp Q y g

Sliding windowsg Only over sliding windows of recent stream data Approximation but often more desirable in applications

Batched processing, sampling and synopses Batched if update is fast but computing is slow

Compute periodically not very timely Compute periodically, not very timely Sampling if update is slow but computing is fast

Compute using sample data, but not good for joins, etc. Compute using sample data, but not good for joins, etc. Synopsis data structures

Maintain a small synopsis or sketch of data Good for querying historical data

Blocking operators, e.g., sorting, avg, min, etc.


Blocking if unable to produce the first output until seeing the entire input

Projects on DSMS (Data Stream Management System)Management System)

Research projects and system prototypes

STREAMSTREAM (Stanford): A general-purpose DSMS

CougarCougar (Cornell): sensors

AuroraAurora (Brown/MIT): sensor monitoring, dataflow

Hancock Hancock (AT&T): telecom streams

NiagaraNiagara (OGI/Wisconsin): Internet XML databases

OpenCQOpenCQ (Georgia Tech): triggers, incr. view maintenance

TapestryTapestry (Xerox): pub/sub content-based filtering

TelegraphTelegraph (Berkeley): adaptive engine for sensors

TradebotTradebot (www.tradebot.com): stock tickers & streams

TribecaTribeca (Bellcore): network monitoring


MAIDS MAIDS (UIUC/NCSA): Mining Alarming Incidents in Data Streams

Stream Data Mining vs. Stream Queryingg Q y g

Stream mining—A more challenging task in many cases It shares most of the difficulties with stream querying

But often requires less “precision”, e.g., no join, q p , g , j ,grouping, sorting

Patterns are hidden and more general than querying It may require exploratory analysis

Not necessarily continuous queries Stream data mining tasks

Multi-dimensional on-line analysis of streamsy Mining outliers and unusual patterns in stream data Clustering data streams


g Classification of stream data

Concept driftConcept drift

• In many applications, the distribution underlying the instances or the rules y gunderlying their labeling may change over time i e the goal of the prediction the classtime, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted may change over time Thispredicted, may change over time. This problem is referred to as concept drift.


Episode Rules

• Association rules applied to sequences of events.

• Episode – set of event predicates and partial ordering on themordering on them

© Prentice Hall 40

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

BasicsBasics

•• Association rules describe how things occur together in Association rules describe how things occur together in the datathe datathe datathe data– E.g., "IF an alarm has certain properties, THEN it will

have other given properties"have other given properties

•• Episode rules describe temporal relationships betweenEpisode rules describe temporal relationships between•• Episode rules describe temporal relationships between Episode rules describe temporal relationships between thingsthings– E g "IF a certain combination of alarms occurs withinE.g., IF a certain combination of alarms occurs within

a time period, THEN another combination of alarms will occur within a time period"

Course on Data MiningCourse on Data Mining 41Page41/54


BasicsBasics

Network Management SystemNetwork Management System Switched NetworkSwitched Network

MSC MSCMSCMSC

BSC BSCBSCBSCAccess NetworkAccess Network

BTSBTS BTSBTSBTSBTSMSCMSC Mobile station controllerm

s

BSCBSC

BTSBTS

Base station controller

Base station transceiver

Ala

rm



BasicsBasics

•• As defined earlier, telecom data contains alarms:As defined earlier, telecom data contains alarms:1234 EL1 PCM 940926082623 A1 ALARMTEXT1234 EL1 PCM 940926082623 A1 ALARMTEXT..

Alarm type Date, time Alarm severity class

•• Now we forget about relationships between attributesNow we forget about relationships between attributesAlarm number

Alarming network element

Now we forget about relationships between attributes Now we forget about relationships between attributes within alarms as with the association ruleswithin alarms as with the association rules

•• We just take the alarm number attribute, handle it hereWe just take the alarm number attribute, handle it hereWe just take the alarm number attribute, handle it here We just take the alarm number attribute, handle it here as event/alarm type and inspect the relationships as event/alarm type and inspect the relationships between events/alarmsbetween events/alarms


EpisodesEpisodes

• Partially ordered set of pages

• Serial episode – totally ordered with timeSerial episode totally ordered with time constraint

P ll l i d i l d d i h i• Parallel episode – partial ordered with time constraint

• General episode – partial ordered with no time constrainttime constraint

© Prentice Hall 44

DAG for EpisodeDAG for Episode

© Prentice Hall 45


BasicsBasics

•• Data:Data:Data is a set R of events– Data is a set R of events

– Every event is a pair (A, t), where• A R is the event type (e g alarm type)• A R is the event type (e.g., alarm type)• t is an integer, the occurrence time of the event

– Event sequence s on R is a triple (s, Ts, Te)Event sequence s on R is a triple (s, Ts, Te)• Ts is starting time and Te is ending time• Ts < Te are integerss e g• s = (A1, t1), (A2, t2), …, (An, tn) • Ai R and Ts ti < Te for all i=1, …, n



BasicsBasics

•• Example alarm data sequence:Example alarm data sequence:

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

D C A B D A B C A D C A B D A

•• Here:Here:– A, B, C and D are event (or here alarm) types, , ( ) yp– 10…150 are occurrence times– s = (D, 10), (C, 20), …, (A, 150) ( ) ( ) ( )– Ts (starting time) = 10 and Te (ending time) = 150

•• Note: There needs Note: There needs notnot to be events on every time slot!to be events on every time slot!



BasicsBasics

•• Episodes:Episodes:A i d i i (V )– An episode is a pair (V, )

• V is a collection of event types, e.g., alarm typesi i l d• is a partial order on V

– Given a sequence S of alarms, an episode = (V, )ithi S if th i f ti f i th toccurs within S if there is a way of satisfying the event

types (e.g., alarm types) in V using the alarms of S so that the partial order is respectedthat the partial order is respected

– Intuitively: episodes consist of alarms that have certain properties and occur in a certain partial order


p p p


BasicsBasics

•• The most useful partial orders are:The most useful partial orders are:T t l d– Total orders

• The predicates of each episode have a fixed orderh i d ll d l ( d d )• Such episodes are called serial (or "ordered")

– Trivial partial orders• The order of predicates is not considered• Such episodes are called parallel (or "unordered")

•• Complicated?Complicated?– Not really, let's take some clarifying examples



BasicsBasics

•• Examples:Examples:

A B A A

B BC

Serial episode Parallel More complex pepisode

pepisode with

serial and parallel



WINEPI ApproachWINEPI Approach

•• The name of the WINEPI method comes from the The name of the WINEPI method comes from the technique it uses: a sliding windowtechnique it uses: a sliding windowtechnique it uses: a sliding windowtechnique it uses: a sliding window

•• Intuitively: Intuitively: – A window is slided through the event-based dataA window is slided through the event-based data

sequence– Each window "snapshot" is like a row in a databasep– The collection of these "snapshots" forms the rows in

the database•• Complicated?Complicated?

– Not really, let's take a clarifying example





0 10 20 30 40 50 60 70 80 90

D C A B D A B C

•• The window width is 40 seconds, last point excluded The window width is 40 seconds, last point excluded •• The first/last window contains only the first/last eventThe first/last window contains only the first/last event


yy



• Formally, given a set E of event types an event sequence an event sequence SS = (= (ss TT TT )) is an ordered sequence of events event suchS S = (= (ss,,TTss,T,Tee)) is an ordered sequence of events eventi such that eventi eventi+1 for all i=1, …, n-1, and Ts eventi < Te for all i=1, …, ne , ,

event1 event2 event3 … … eventn

T TTs Te

t1 t2 t3 … … tn




• Formally, a windowwindow on event sequence S is an event sequence S=(w t t ) where t < T t > T and w consistssequence S=(w,ts,te), where ts < Te, te > Ts, and w consists of those pairs (event, t) from s where ts t < te

• The value t t < t is called window width W• The value ts t < te is called window width, W


T TTs Te

t1 t2 t3 tnWWttss ttee




• By definition, the first and the last windows on a sequence extend outside the sequence so that the last windowextend outside the sequence, so that the last window contains only the first time point of the sequence, and the last window only the last time pointy p


T T

WWttss ttee

Ts Te

t1 t2 t3 tnWWttss ttee




• The frequencyfrequency (cf. support with association rules) of an episode is the fraction of windows in which the episodeepisode is the fraction of windows in which the episode occurs, i.e.,

|Sw W(S, W) | occurs in Sw |fr(, S, W) =

|W(S W)||W(S, W)|

where W(S, W) is the set of all windows Sw of sequence S such that the window width is Wsuch that the window width is W




• When searching for the episodes, a frequency thresholdfrequency threshold(cf support threshold with association rules) min fr is used(cf. support threshold with association rules) min_fr is used

• Episode is frequent if fr(, s, win) min_fr, i.e, "if the freq enc of e ceeds the minim m freq enc thresholdfrequency of exceeds the minimum frequency threshold within the data sequence s and with window width win"

F( i i f ) ll ti f f t i d i• F(s, win, min_fr): a collection of frequent episodes in swith respect to win and min_fr

•• Apriori trick holds:Apriori trick holds: if an episode is frequent in an event sequence s, then all subepisodes are frequent




•• FormallyFormally, an episode rule is as expression , where and are episodes such that is a subepisode of and are episodes such that is a subepisode of

• An episode is a subepisode of ( ), if the graph representation is a subgraph of the representation of representation is a subgraph of the representation of

A AA:

AC:

B B




• The fraction

fr(, S, W) = frequency of the whole episodefr( S W) = frequency of the LHS episodefr(, S, W) frequency of the LHS episode

is the confidenceconfidence of the WINEPI episode rulep

• The confidence can be interpreted as the conditional pprobability of the whole of occurring in a window, given that occurs in it




•• Intuitively: Intuitively: WINEPI l lik i i l b i h– WINEPI rules are like association rules, but with an additional time aspect: If events (alarms) satisfying the rule antecedent (leftIf events (alarms) satisfying the rule antecedent (left-hand side) occur in the right order within W time units, then also the rule consequent (right-hand side) occurs inthen also the rule consequent (right hand side) occurs in the location described by , also within W time units

antecedent antecedent consequent [window width] (f, c)consequent [window width] (f, c)



WINEPI AlgorithmWINEPI Algorithm•• InputInput: A set R of event/alarmtypes, an event sequence s over R, a set E

of episodes, a window width win, and a frequency threshold min_frO t tO t t Th ll i F( i i f )•• OutputOutput: The collection F(s, win, min_fr)

•• MethodMethod:1. compute C1 := { E | || = 1};p 1 { | | | };2. i := 1;3. while Ci do4 (* t F(s i i f ) { C | f ( s i ) i f }4.(* compute F(s, win, min_fr) := { Ci | fr(, s, win) min_fr};5. i := l+1;6.(** compute Ci:= { E | || = I, and F||(s, win, min_fr) for ||

all E, };

(* = database pass, (** candidate generation


( database pass, ( candidate generation


WINEPI AlgorithmWINEPI Algorithm

• First problem: given a sequence and a episode, find out whether the episode occurs in the sequencewhether the episode occurs in the sequence

• Finding the number of windows containing an occurrence of the episode can be reduced to this

• Successive windows have a lot in common• How to use this?

– An incremental algorithm– Same idea as for association rules– A candidate episode has to be a combination of two episodes ofA candidate episode has to be a combination of two episodes of

smaller size– Parallel episodes, serial episodes



WINEPI AlgorithmWINEPI Algorithm

•• Parallel episodes:Parallel episodes:F h did i i– For each candidate maintain a counter .event_count: how many events of are present in the windowWhen t t becomes eq al to || indicating– When .event_count becomes equal to ||, indicating that is entirely included in the window, save the starting time of the window in .inwindowstarting time of the window in .inwindow

– When .event_count decreases again, increase the field .freq count by the number of windows where f q_ yremainded entirely in the window

•• Serial episodes: use a state automataSerial episodes: use a state automata





0 10 20 30 40 50 60 70 80 90

D C A B D A B C

•• The window width is 40 secs, The window width is 40 secs, movement stepmovement step 10 secs 10 secs •• The length of the sequence is 70 secs (10The length of the sequence is 70 secs (10--80)80)


g q (g q ( ))



•• By sliding the window, we'll get 11 windows (UBy sliding the window, we'll get 11 windows (U11--UU1111): ):

U2

…

U1

U2U11

0 10 20 30 40 50 60 70 80 90

D C A B D A B C

•• Frequency threshold is set to 40%, i.e., an episode has Frequency threshold is set to 40%, i.e., an episode has to occur at least in 5 of the 11 windowsto occur at least in 5 of the 11 windows







•• Suppose that the task is to find all parallel episodes:Suppose that the task is to find all parallel episodes:Fi t t i l t i ll l i d f i 1 (A B C D)– First, create singletons, i.e., parallel episodes of size 1 (A, B, C, D)

– Then, recognize the frequent singletons (here all are)– From those frequent episodes, build candidate episodes of size 2:From those frequent episodes, build candidate episodes of size 2:

AB, AC, AD, BC, BD, CD– Then, recongize the frequent parallel episodes (here all are)– From those frequent episodes, build candidate episodes of size 3:

ABC, ABD, ACD, BCD – When recognizing the frequent episodes, only ABD occurs in more g g q p , y

than four windows– There are no candidate episodes of size four




•• Episode frequencies and example rules with WINEPI:Episode frequencies and example rules with WINEPI:

D 73%D : 73%C : 73%A : 64%B : 64% D A [40] (55%, 75%)D C : 45%D A : 55%D A : 55%D B : 45% D A B [40] (45%, 82%)C A : 45%C B 45%C B : 45%A B : 45%D A B : 45%



WINEPI: Experimental ResultsWINEPI: Experimental Results

•• Data:Data:Al f l i i k– Alarms from a telecommunication network

– 73 000 events (7 weeks), 287 event typesll l d i l i d– Parallel and serial episodes

– Window widths (W) 10-120 seconds– Window movement = W/10– min_fr = 0.003 (0.3%), frequent: about 100 occurrences– 90 MHz Pentium, 32MB memory, Linux operating

system. The data resided in a 3.0 MB flat text file



WINEPI: Experimental ResultsWINEPI: Experimental Results

Window Serial episodes Parallel episodesWindow Serial episodes Parallel episodeswidth (s) #frequent time (s) #frequent time (s)10 16 31 10 820 31 63 17 940 57 117 33 1460 87 186 56 1560 87 186 56 1580 145 271 95 21100 245 372 139 21100 245 372 139 21120 359 478 189 22




•• One shortcoming in WINEPI approach:One shortcoming in WINEPI approach:C id h l f A d l f– Consider that two alarms of type A and one alarm of type B occur in a windowDoes the parallel episode consisting of A and B appear– Does the parallel episode consisting of A and B appear once or twice?If once then with which alarm of type A?– If once, then with which alarm of type A?

0 10 20 30 40 50 60 70 80 90

D C A B D A B C



MINEPI ApproachMINEPI Approach

•• Alternative approach to discovery of episodesAlternative approach to discovery of episodesN lidi i d– No sliding windows

– For each potentially interesting episode, find out the exact occurrences of the episodeexact occurrences of the episode

•• Advantages:Advantages: easy to modify time limits, several time limits for one rule:limits for one rule:

"If A and B occur within 15 seconds, then C follows within 30 seconds"within 30 seconds

•• Disadvantages:Disadvantages: uses a lots of space




• Formally, given a episode and an event sequence S, the interval [t t ] is a minimal occurrenceminimal occurrence of Sinterval [ts,te] is a minimal occurrenceminimal occurrence of S,– If occurs in the window corresponding to the interval

If does not occ r in an proper s binter al– If does not occur in any proper subinterval

Th t f i i lt f i i l f i d i• The set of minimal occurrencesset of minimal occurrences of an episode in a given event sequence is denoted by mo():

mo() = { [ts,te] | [ts,te] is a minimal occurrence of }




• Example: Parallel episode consisting of event types Aand B has three minimal occurrences in s: {[30 40]and B has three minimal occurrences in s: {[30,40], [40,60], [60,70]}, has one occurrence in s: {[60,80]}

A AC: :

B B

D C A B D A B C

0 10 20 30 40 50 60 70 80 90




•• InformallyInformally, a MINEPI episode rule gives the conditional probability that a certain combination of events (alarms)probability that a certain combination of events (alarms) occurs within some time bound, given that another combi-nation of events (alarms) has occurred within a time bound( )

• Formally, an episode ruleepisode rule is [win1] [win2]• and are episodes such that ( is a subepisode of and are episodes such that ( is a subepisode of

)• If episode has a minimal occurrence at interval [ts,te] p [ s e]

with te - ts win1, then episode occurs at interval [ts,t'e] for some t'e such that t'e - ts win2


Pattern DiscoveryPattern Discovery1. Choose the language (formalism) to represent

the patterns (search space)the patterns (search space)

2. Choose the rating for patterns, to tell which is “better” than others

3 Design an algorithm that finds the best3. Design an algorithm that finds the best patterns from the pattern class, fast.

Brazma A, Jonassen I, Eidhammer I, Gilbert D.Brazma A, Jonassen I, Eidhammer I, Gilbert D.Approaches to the automatic discovery of patterns in biosequences.J Comput Biol. 1998;5(2):279-305.

Level 0 ATCGCTGAATTCCAATGTG

Level 1Eukaryotic genome can be

Level 2genome can be thought of as six Levels of DNA structureLevel 3 structure.

The loops at L l 4

Level 4Level 4 range from 0.5kb to 100kb in length.

Level 5If these loops were stabilized then the genes inside the loop would not be expressed.

Level 6expressed.

DNA determines function (?)DNA determines function (?)

DNAGenBank / EMBL Bank

ProteinSwissProt/TrEMBL

StructurePDB/Molecular Structure DatabaseGenBank / EMBL Bank SwissProt/TrEMBL PDB/Molecular Structure Database

4 Nucleotides 20+ Amino Acids4 Nucleotides 20 Amino Acids(3nt 1 AA)

Function?

A Simple GeneA Simple Gene

A: B: C:

ATCGAAAT +M difi ti

Upstream/promoter

Downstream

DNA ATCGAAATTAGCTTTA

+ModificationsDNA:

Species and individualsSpecies and individuals

• Animals, plantsfungi, bacteria, …g

S i• Species

• Individuals

www.tolweb.org

Gene Regulatory Signal Finding

Transcription Factor

Transcription Factor Binding Site

Goal: Detect Transcription Factor Binding SitesGoal: Detect Transcription Factor Binding Sites.Eleazar Eskin: Columbia Univ.

TGTTCTTTCTTCTTTCATACATCCTTTTCCTTTTTTTCCTTCTCCTTTCATTTCCTGACTTTTAATATAGGCTTACCATTCTCCTTTCATTTCCTGACTTTTAATATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTACATTGCTTCTTCTTCGATTGCTTCAAAGTAGTTCGTGAATCATCCTTCAATGCCTCAGCACCTTCAGCACTTGCACTTCATTCTCTGGAAGCCTCAGCACCTTCAGCACTTGCACTTCATTCTCTGGAAGTGCTGCACCTGCGCTGTCTTGCTAATGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGACATGGGCGGCGTCTTCTTCGAATTCCATCAGTCCTCATAGTTCTGTTGGTTCTTTTTCGAATTCCATCAGTCCTCATAGTTCTGTTGGTTCTTTTCTCTGATGATCGTCATCTTTCACTGATCTGATGTTCCTGTGCCCTATCTATATCATCTCAAAGTTCACCTTTGCCACTTTCCAAGATCTCTCATTCATAATGGGCTTAAAGCCGTACTTCCAAGATCTCTCATTCATAATGGGCTTAAAGCCGTACTTTTTTCACTCGATGAGCTATAAGAGTTTTCCACTTTTAGATCGTGGCTGGGCTTATATTACGGTGTGATGAGGGCGCTTGAAAAGATTTTTTCATCTCACAAGCGACGAGGGCCCGAGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGTCTTAGAAGATAAAGTAGTGAATTACAATAGATTCGATAC

Patterns: ATPatterns: AT

Patterns: [AT][ACT]ATIUPAC: W H AT

Cluster of co‐expressed genes, di i l ipattern discovery in regulatory regions

600 basepairs

Expression profiles

Retrieve

Upstream regions

Find patterns over-represented within clusterGenome Research 1998; ISMB (Intelligent Systems in Mol. Biol.) 2000

Binomial or hypergeometric distribution

Background -ALL upstream

sequencesCluster:

occurs 3 times

P(3,6,0.2) is probabilityof having 3 matchesof having 3 matches in 6 sequences

0 2P(,3,6,0.2)

5 out of 25, p = 0.2 =0.0989

Pattern vs cluster “strength”Pattern vs cluster strength

The pattern probability vs. the average silhouette for the cluster

The same for randomised clusters

Vilo et.al. ISMB 2000

The most unprobable pattern from best clustersclusters

Pattern Probability Cluster Occurrences Total nr of Ksize in cluster occurrences in K-means

AAAATTTT 2.59E-43 96 72 830 60ACGCG 6 41E 39 96 75 1088 50ACGCG 6.41E-39 96 75 1088 50ACGCGT 5.23E-38 94 52 387 40CCTCGACTAA 5.43E-38 27 18 23 220GACGCG 7.89E-31 86 40 284 38TTTCGAAACTTACAAAAAT 2.08E-29 26 14 18 450TTCTTGTCAAAAAGC 2.08E-29 26 14 18 325ACATACTATTGTTAAT 3.81E-28 22 13 18 280GATGAGATG 5.60E-28 68 24 83 84TGTTTATATTGATGGA 1.90E-27 24 13 18 220GATGGATTTCTTGTCAAAA 5.04E-27 18 12 18 500GATGGATTTCTTGTCAAAA 5.04E 27 18 12 18 500TATAAATAGAGC 1.51E-26 27 13 18 300GATTTCTTGTCAAA 3.40E-26 20 12 18 700GATGGATTTCTTG 3.40E-26 20 12 18 875GGTGGCAA 4.18E-26 40 20 96 180TTCTTGTCAAAAAGCA 5 10E 26 29 13 18 250TTCTTGTCAAAAAGCA 5.10E-26 29 13 18 250CGAAACTTACAAA 5.10E-26 29 13 18 290GAAACTTACAAAAATAAA 7.92E-26 21 12 18 650TTTGTTTATATTG 1.74E-25 22 12 18 600ATCAACATACTATTGT 3.62E-25 23 12 18 375ATCAACATACTATTGTTA 3.62E-25 23 12 18 625GAACGCGCG 4.47E-25 20 11 13 260GTTAATTTCGAAAC 7.23E-25 24 12 18 400GGTGGCAAAA 3.37E-24 33 14 31 475ATCTTTTGTTTATATTGA 7 19E 24 19 11 18 675ATCTTTTGTTTATATTGA 7.19E-24 19 11 18 675TTTGTTTATATTGATGGA 7.19E-24 19 11 18 475GTGGCAAA 1.14E-23 28 18 137 725

Vilo et.al. ISMB 2000

GGTGGCAA - proteasome associated control element

YOR261C YOR261C RPN8 protein degradation 26S proteasome regulatory subunit S0005787 1YDL020C YDL020C RPN4 i d d i bi i i 26S b i S0002178 1YDL020C YDL020C RPN4 protein degradation, ubiquitin26S proteasome subunit S0002178 1YDL007W YDL007W RPT2 protein degradation 26S proteasome subunit S0002165 1YDL147W YDL147W RPN5 protein degradation 26S proteasome subunit S0002306 1YOL038W YOL038W PRE6 protein degradation 20S proteasome subunit (alpha4) S0005398 1YKL145W YKL145W RPT1 protein degradation, ubiquitin26S proteasome subunit S0001628 1YDL097C YDL097C RPN6 protein degradation 26S proteasome regulatory subunit S0002255 1YDR394W YDR394W RPT3 protein degradation 26S proteasome subunit S0002802 1YBR173C YBR173C UMP1 t i d d ti bi iti 20S t t ti f t S0000377 1YBR173C YBR173C UMP1 protein degradation, ubiquitin20S proteasome maturation factor S0000377 1YER012W YER012W PRE1 protein degradation 20S proteasome subunit C11(beta4) S0000814 1YPR108W YPR108W RPN7 protein degradation 26S proteasome regulatory subunit S0006312 1YOR117W YOR117W RPT5 protein degradation 26S proteasome regulatory subunit S0005643 1YJL001W YJL001W PRE3 protein degradation 20S proteasome subunit (beta1) S0003538 1YPR103W YPR103W PRE2 protein degradation 20S proteasome subunit (beta5) S0006307 1YOR157C YOR157C PUP1 protein degradation 20S proteasome subunit (beta2) S0005683 1YGL048C YGL048C RPT6 t i d d ti 26S t l t b it S0003016 1YGL048C YGL048C RPT6 protein degradation 26S proteasome regulatory subunit S0003016 1YHR200W YHR200W RPN10 protein degradation 26S proteasome subunit S0001243 1YML092C YML092C PRE8 protein degradation 20S proteasome subunit Y7 (alpha2 S0004557 1YIL075C YIL075C RPN2 tRNA processing 26S proteasome subunit) S0001337 1YMR314W YMR314W PRE5 protein degradation 20S proteasome subunit(alpha6) S0004931 1YGR253C YGR253C PUP2 protein degradation 20S proteasome subunit(alpha5) S0003485 1YGR135W YGR135W PRE9 protein degradation 20S proteasome subunit Y13 (alpha3) S0003367 1YFR004W YFR004W RPN11 t i ti t ti l b l l t S0001900 1YFR004W YFR004W RPN11 transcription putative global regulator S0001900 1YOR259C YOR259C RPT4 protein degradation 26S proteasome regulatory subunit S0005785 1YFR052W YFR052W RPN12 protein degradation 26S proteasome regulatory subunit S0001948 1YFR050C YFR050C PRE4 protein degradation proteasome subunit, B type S0001946 1YGL011C YGL011C SCL1 protein degradation 20S proteasome subunit YC7ALPHA/Y8 S0002979 1YDR427W YDR427W RPN9 protein degradation 26S proteasome regulatory subunit S0002835 1YOR362C YOR362C PRE10 protein degradation 20S proteasome subunit C1 (alpha7) S0005889 1YBL041W YBL041W PRE7 t i d d ti 20S t b it S0000137 1YBL041W YBL041W PRE7 protein degradation 20S proteasome subunit S0000137 1YER021W YER021W RPN3 protein degradation 26S proteasome regulatory subunit S0000823 1YER094C YER094C PUP3 protein degradation 20S proteasome subunit (beta3 S0000896 1YGR270W YGR270W YTA7 protein degradation 26S proteasome subunit; ATPase S0003502 1YHR027C YHR027C RPN1 protein degradation 26S proteasome regulatory subunit S0001069 1YER047C YER047C SAP1 mating type switching AAA family protein S0000849 1YGR232W YGR232W unknown unknown S0003464 1

>YAL036C chromo=1 coord=(76154-75048(C)) start=-600 end=+2 seq=(76152-76754)

TGTTCTTTCTTCTTCTGCTTCTCCTTTTCCTTTTTTTCCTTCTCCTTTTCCTTCTTGGACTTTAGTATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTTTCTTGCTTCTTCTTCGATTGCTTCAAAGTAGACATGAAGTCGCCTTCAATGGCCTCAGCACCTTCAGCACTTGCACTTGCTTCTCTGGAAGTGTCATCTGCACCTGCGCTGCTTTCTGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGTTCTGGGCGGCGTCTTCTTCGAATTCCTCATCCCAGTAGTTCTGTTGGTTCTTTTTACTCTTTTTCGCCATCTTTCACTTATCTGATGTTCCTGATTGCCCTTCTTATCCCCTCAAAGTTCACCTTTGCCACTTATTCTAGTGC

101 Sequences relative to ORF startYGR128C + 100GTCTTCTTCGAATTCCTCATCCCAGTAGTTCTGTTGGTTCTTTTTACTCTTTTTCGCCATCTTTCACTTATCTGATGTTCCTGATTGCCCTTCTTATCCCCTCAAAGTTCACCTTTGCCACTTATTCTAGTGCAAGATCTCTTGCTTTCAATGGGCTTAAAGCTTGAAAAATTTTTTCACATCACAAGCGACGAGGGCCCGTTTTTTTCATCGATGAGCTATAAGAGTTTTCCACTTTTAAGATGGGATATTACGGTGTGATGAGGGCGCAATGATAGGAAGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGTTGTTGATTGAGCAAA_ATG_>YAL025C chromo=1 coord=(101147-100230(C)) start=-600 end=+2 seq=(101145-101747)CTTAGAAGATAAAGTAGTGAATTACAATAAATTCGATACGAACGTTCAAATAGTCAAGAATTTCATTCAAAGGGTTCAATGGTCCAAGTTTTACACTTTCAAAGTTAACCACGAATTGCTGAGTAAGTGTGTTTATATTAGCACATTAACACAAGAAGAGATTAATGAACTATCCACATGAGGTATTGTGCCACTTTCCTCCAGTTCCCAAATTCCTCTTGTAAAAAACTTTGCATATAAAATATACAGATGGAGCATATATAGATGGAGCATACATACATGTTTTTTTTTTTTTAAAAACATGGACTCGAACAGAATAAAAGAATTTATAATGATAGATAATGCATACTTCAATAAGAGAGAATACTTGTTTTTAAATGAGAATTGCTTTCATTAGCTCATTATGTTCAGATTATCAAAATGCAGTAGGGTAATAAACCTTTTTTTTTTTTTTTTTTTTTTTTGAAAAATTTTCCGATGAGCTTTTGAAAAAAAATGAAAAAGTGATTGGTATAGAGGCAGATATTGCATTGCTTAGTTCTTTCTTTTGACAGTGTTCTCTTCAGTACATAACTACAACGGTTAGAATACAACGAGGAT_ATG_

...>YBR084W chromo=2 coord=(411012-413936) start=-600 end=+2 seq=(410412-411014)CCATGTATCCAAGACCTGCTGAAGATGCTTACAATGCCAATTATATTCAAGGTCTGCCCCAGTACCAAACATCTTATTTTTCGCAGCTGTTATTATCATCACCCCAGCATTACGAACATTCTCCACATCAAAGGAACTTTACGCCATCCAACCAATCGCATGGGAACTTTTATTAAATGTCTACATACATACATACATCTCGTACATAAATACGCATACGTATCTTCGTAGTAAGAACCGTCACAGATATGATTGAGCACGGTACAATTATGTATTAGTCAAACATTACCAGTTCTCGAACAAAACCAAAGCTACTCCTGCAACACTCTTCTATCGCACATGTATGGTTCTTATTGTTTCCCGAGTTCTTTTTTACTGACGCGCCAGAACGAGTAAGAAAGTTCTCTAGCGCCATGCTGAAATTTTTTTCACTTCAACGGACAGCGATTTTTTTTCTTTTTCCTCCGAAATAATGTTGCAGCGGTTCTCGATGCCTCAAGAATTGCAGAAGTAAACCAGCCAATACACATCAAAAAACAACTTTCATTACTGTGATTCTCTCAGTCTGTTCATTTGTCAGATATTTAAGGCTAAAAGGAA_ATG_

GATGAG.T 1:52/70 2:453/508 R:7.52345 BP:1.02391e-33G.GATGAG.T 1:39/49 2:193/222 R:13.244 BP:2.49026e-33AAAATTTT 1 63/77 2 833/911 R 4 95687 BP 5 02807 32AAAATTTT 1:63/77 2:833/911 R:4.95687 BP:5.02807e-32TGAAAA.TTT 1:45/53 2:333/350 R:8.85687 BP:1.69905e-31TG.AAA.TTT 1:53/61 2:538/570 R:6.45662 BP:3.24836e-31TG.AAA.TTTT 1:40/43 2:254/260 R:10.3214 BP:3.84624e-30TGAAA..TTT 1:54/65 2:608/645 R:5.82106 BP:1.0887e-29...

GATGAG TGATGAG.TTGAAA..TTT

Sequence patterns:Sequence patterns: the basis of the SPEXSthe basis of the SPEXS

A G A AT C GC C C

GCAT (4 positions)

GCATA (3 positions)

GCATA.GCATA.C

SPEXS substringsSPEXS ‐ substrings

• enqueue( Q , Empty pattern (occurs everywhere) )

• while P = deque( Q )while P deque( Q ) – check all positions of P.pos

F h i P– For every character c in P.pos• create pattern Pc

• advance all positions in Pc.pos by 1

• enqueue( Q, Pc )


SPEXS: count and memorizeSPEXS: count and memorize

i...v....x....v....xabracadabradadabraca

aa{1,4,6,8,11,13,15,18,20}

{2,5,7,9,12,14,16,19,21}

SPEXS: extendSPEXS: extend …


aa

b{2,5,7,9,12,14,16,19,21}

cb d{5,19}{2,9,16} {7,12,14}

SPEXS: find frequent firstSPEXS: find frequent first


aa

b{2,5,7,9,12,14,16,19,21}

b d{2,9,16} {7,12,14}

SPEXS: group positionsSPEXS: group positions


aa

b

{2,5,7,9,12,14,16,19,21}

[bd].

b d [bd]

{2,9,16} {7,12,14} {2,7,9,12,14,16}

The wildcards

GCAT.{3,6}X

The wildcards

GCAT.*X

The wildcards: not too many

w:0

aw:0

.{3.6}b1w:1w:0

SPEXS: general algorithmSPEXS: general algorithm1 S = input sequences ( ||S||=n )1. S input sequences ( ||S|| n )2. e = empty pattern, e.pos = {1,...,n}3. enqueue( order , e )

4. while p = dequeue( order ) 5. generate all allowed extensions p’ of p (& p’.pos)g p p ( p p )6. enqueue( order, p’, priority(p’) ) 7. enqueue( output, p’, fitness(p’) )

8. while p = dequeue( output )9. Output p

Jaak Vilo: Discovering Frequent Patterns from Strings.Technical Report C-1998-9 (pp. 20) May 1998. Department of Computer Science, University of Helsinki.

Applications in bioinformatics:

-Gene regulation (1998: 255+ citations, 2000: 73 cit)

Jaak Vilo: Pattern Discovery from Biosequences PhD Thesis, Department of Computer Science, University of Helsinki, Finland. Report A-2002-3 Helsinki, November 2002, 149 pages

-Functional elements in proteins (2002: 32 cit)

SPEXS S P tt EXh ti S hSPEXS ‐ Sequence Pattern EXhaustive SearchJaak Vilo, 1998, 2002

• User‐definable pattern language: substrings, character groups, wildcards, flexible wildcards (c.f. PROSITE)

• Fast exhaustive search over pattern language ( )• “Lazy suffix tree construction”‐like algorithm (Kurtz, Giegerich)

• Analyze multiple sets of sequences simultaneouslyR t i t h t t f t tt l (i h t)• Restrict search to most frequent patterns only (in each set)

• Reportmost frequent patterns, patterns over‐ or underrepresented in selected subsets, or patterns significant byunderrepresented in selected subsets, or patterns significant by various statistical criteria, e.g. by binomial distribution

30min

Multiple data setsD1 D2 D3

4/3 (6) 3/3 (12) 2/2 (9)

.G.GATGAG.T. 39 seq

.G.GATGAG.T. 39 seq (vs 193) p= 2.5e-33p

-1: .G.GATGAG.T. 61 seq (vs 1292)

-1: .G.GATGAG.T. 61 seq (vs 1292) p= 1.4e-19p

-2: .G.GATGAG.T. 91 seq

-2: .G.GATGAG.T. 91 seq (vs 5464)


Jaak Vilo: Pattern Discovery from BiosequencesJaak Vilo: Pattern Discovery from Biosequences PhD Thesis, Department of Computer Science, University of Helsinki, FinlandSeries of Publications A, Report A-2002-3 Helsinki, November 2002, 149 pages


These hits result in a PWM:

PWM based on all previous hits, here shown highest-scoring occurrences in blue

All against all approximate matching

For every subsequence of every sequence

Match approximately against all the the sequences.

Approximate hits define PWM matrices (not all positions vary equally).

Look for ALL PWM-s derived from data that are enriched in data set (vs. background).

Hendrik Nigul, Jaak Vilo

Dynamic programmingDynamic programming

• Small nr of edit operations allows to limit the search efficiently around main diagonaly g

Suffix TreeSuffix Tree

AC G T

G

G

T

{1:24,2:12,2:23…}

Trie based ll i t ll i t t hiall against all approximatematching

• trieindex

• trieagrep

• trieallagrep• trieallagrep

• triematrix

Hendrik Nigul, Jaak Vilo

Documents

(4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying