Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Data Mining MTAT.03.183(4AP 6EAP)(4AP = 6EAP)
Streams, time series,
Jaak ViloJaak Vilo
2009 Fall
Summary so farSummary so far
• Data preparation
• Machine learningMachine learning
• Statistics/significance
• Large data – algorithmics
• VisualisationVisualisation
• Queries/reporting, OLAP
• Different types of data
• Business value• Business valueJaak Vilo and other authors UT: Data Mining 2009 2
Streams time seriesStreams, time series
• Time
• Sequence order and positionSequence order and position
• Continuosly arriving data
Jaak Vilo and other authors UT: Data Mining 2009 3
WikipediaWikipedia
• Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery.
Jaak Vilo and other authors UT: Data Mining 2009 4
• In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values ofsome knowledge about the class membership or values of previous instances in the data stream. Machine learning techniques can be used to learn this prediction task from q plabeled examples in an automated fashion. In many applications, the distribution underlying the instances or the l d l h l b l h hrules underlying their labeling may change over time, i.e. the
goal of the prediction, the class to be predicted or the target value to be predicted may change over time This problem isvalue to be predicted, may change over time. This problem is referred to as concept drift.
Jaak Vilo and other authors UT: Data Mining 2009 5
SoftwareSoftware
• RapidMiner: free open‐source software for knowledge discovery, data mining, and g y gmachine learning also featuring data stream mining learning time‐varying concepts andmining, learning time varying concepts, and tracking drifting concept (if used in combination with its data stream miningcombination with its data stream mining plugin (formerly: concept drift plugin))
Jaak Vilo and other authors UT: Data Mining 2009 6
• MOA (Massive Online Analysis): free open‐source software specific for mining datap gstreams with concept drift. It contains a prequential evaluation method the EDDMprequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets and artificial stream generators asdatasets, and artificial stream generators asSEA concepts, STAGGER, rotating hyperplane, random tree, and random radius basedfunctions. MOA supports bi‐directionalinteraction with Weka (machine learning).
Jaak Vilo and other authors UT: Data Mining 2009 7
Literature on Stream MiningLiterature on Stream Mining
• http://www.csse.monash.edu.au/~mgaber/WResources.htm
Jaak Vilo and other authors UT: Data Mining 2009 8
Mining Data Streamsg
What is stream data? Why Stream Data Systems?
Stream data management systems: Issues and solutions Stream data management systems: Issues and solutions
Stream data cube and multidimensional OLAP analysis
Stream frequent pattern analysis
Stream classification
Stream cluster analysis Stream cluster analysis
Research issues
November 25, 2009 Data Mining: Concepts and Techniques 9
Characteristics of Data Streams
Data Streams Data Streams Data streams—continuous, ordered, changing, fast, huge amount
T di i l DBMS d d i fi i i dd Traditional DBMS—data stored in finite, persistent data setsdata sets
Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single scan algorithm (can only have
one look)one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in
November 25, 2009 Data Mining: Concepts and Techniques 10
Most stream data are at pretty low level or multi dimensional in nature, needs multi-level and multi-dimensional processing
Stream Data Applicationspp
Telecommunication calling recordsTelecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Network monitoring and traffic engineering Financial market: stock exchange
E i i & i d t i l l & Engineering & industrial processes: power supply & manufacturingS it i & ill id t RFID Sensor, monitoring & surveillance: video streams, RFIDs
Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too
November 25, 2009 Data Mining: Concepts and Techniques 11
expensive)
DBMS versus DSMSe sus
Persistent relations Transient streams Persistent relations
One-time queries
Random access
Transient streams
Continuous queries
Sequential access Random access
“Unbounded” disk store
Only current state matters
Sequential access
Bounded main memory
Historical data is important Only current state matters
No real-time services
Relatively low update rate
Historical data is important
Real-time requirements
Possibly multi-GB arrival rate Relatively low update rate
Data at any granularity
Assume precise data
Possibly multi GB arrival rate
Data at fine granularity
Data stale/imprecise Assume precise data
Access plan determined by query processor, physical DB
Data stale/imprecise
Unpredictable/variable data arrival and characteristics
November 25, 2009 Data Mining: Concepts and Techniques 12
q y p , p ydesign Ack. From Motwani’s PODS tutorial slides
Mining Data StreamsMining Data Streams
What is stream data? Why Stream Data Systems?
Stream data management systems: Issues and solutions Stream data management systems: Issues and solutions
Stream data cube and multidimensional OLAP analysis
Stream frequent pattern analysis
Stream classification
Stream cluster analysis Stream cluster analysis
Research issues
November 25, 2009 Data Mining: Concepts and Techniques 13
Architecture: Stream Query ProcessingQ y g
User/ApplicationUser/ApplicationSDMS (Stream Data Use / pp cat oUse / pp cat o
Continuous QueryContinuous Query
Management System)
Continuous QueryContinuous Query
ResultsResults
Stream QueryStream QueryProcessorProcessor
Multiple streamsMultiple streams
ProcessorProcessor
Scratch SpaceScratch Space
November 25, 2009 Data Mining: Concepts and Techniques 14
Scratch SpaceScratch Space(Main memory and/or Disk)(Main memory and/or Disk)
Stream Data Mining Tasks
On-Line analysis of streams Clustering data streams g Classification of data streams Mining frequent patterns in data streams g q p Mining sequential patterns in data streams Mining partial periodicity in data streams Mining outliers and unusual patterns in data
streams ……
Clustering on Streamsg
K-means - not suitable for stream miningnot suitable for stream mining
Clustream- assume shape of the cluster is always assume shape of the cluster is always
circle. Denstream Denstream
- detects arbitrary shape clusters in stream data.
Frequent Pattern Mining (FPM)i d in data streams
Frequent Pattern Mining (FPM) in data streams.Frequent (/hot/top) patterns:Items/Item sets/Sequences occurring, frequently in a database.
ISSUES-Limited memory
-Reading past data is impossible.
Question: How much is it justified to mine f l i d ??frequent pattern only in data stream??
Infrequent pattern mining
Objective:1. To find-out the abnormality , surprising or
“interesting” pattern in the data stream.2. Mutual pattern mining.3 Stream specific item set mining3. Stream specific item set mining.4. Association Rule mining among event of interest.
Application:1. Text mining.2 Distributed Sensor Networks2. Distributed Sensor Networks.3. Works well for evolving data stream.
Challenges in Stream Data Analysisg y
• Data Volume is Huge• Need to remember recent and historical data• Approaches to data reduction• Need single linear scan algorithms• Most existing algorithms and prototype systems are
memory and CPU bound and can only perform a single memory and CPU bound, and can only perform a single data mining function
• Desire to perform multiple analysis at the same timep p y• Occurrence of concept drifts where previous model is
no longer valid• Reduce the cost of learning where models need to be
updated and replacedRequire instant response• Require instant response
Loretta Auvil
Stream Data Reduction
• Challenges of “OLAP-ing” stream data• Challenges of OLAP ing stream data• Raw data cannot be stored• Simple aggregates are not powerful enoughSimple aggregates are not powerful enough• History shape and patterns at different levels are desirable
• MAIDS Unique Approach• A tilted time window to aggregate data at different points gg g p
in time• A scalable multi-dimensional stream data cube that can
t d l f t d t ffi i tl ith t aggregate a model of stream data efficiently without accessing the raw data
Loretta Auvil
MAIDS Approach: Tilted Time Windowpp
• Recent data is registered and weighted at a finer • Recent data is registered and weighted at a finer granularity than longer term data
• As the edge of a time widow is reached, the finer As the edge of a time widow is reached, the finer granularity data is summarized and propagated to a courser granularity
• Window is maintained automatically
24h 4qtrs 15 i t7d 3024hrs 4qtrs 15minutes7days 30sec
PastTime
Present
Loretta Auvil
MAIDS: Stream Mining Architectureg
MAIDS is aimed to:M S s a ed to:• Discover changes,
trends and evolution characteristics in data streams
• Construct clusters and classification models f d from data streams
• Explore frequent patterns and patterns and similarities among data streamsdata streams
Loretta Auvil
Features of MAIDS
• General purpose tool for data stream analysis• General purpose tool for data stream analysis• Processes high-rate and multi-dimensional data• Adopts a flexible tilted time window framework• Adopts a flexible tilted time window framework• Facilitates multi-dimensional analysis using a stream
cube architecturecube architecture• Integrates multiple data mining functions• Provides user-friendly interface: automatic analysis and • Provides user friendly interface: automatic analysis and
on-demand analysis• Facilitates setting alarms for monitoringFacilitates setting alarms for monitoring• Built in D2K as D2K modules and leveraged in the D2K
Streamline tool
Loretta Auvil
Statistics Query EngineQ y g
• Answers user queries on data statistics, such as, count, max, min a erage min, average, regression, etc.
U tilt d ti • Uses tilted time window
U ffi i t d t • Uses an efficient data structure, H-tree for partial computation partial computation of data cubes
Loretta Auvil
Stream Data Classifier
• Builds models to • Builds models to make predictions
Uses Naïve • Uses Naïve Bayesian Classifier with boosting
• Uses Tilted Time Uses Tilted Time Window to track time related info
• Sets alarm to monitor events
Loretta Auvil
Stream Pattern Finder
• Find frequent tt ith patterns with
multiple time granularities
• Keep precise/ compressed history in tilted time windowtilted time window
• Mine only the interested item set interested item set using FP-tree algorithm
• Mining evolution and dramatic changes of frequent patternsfrequent patterns
Loretta Auvil
Stream Data Clusteringg
T t i• Two stages: micro-clustering and macro-clustering
• Uses micro-clustering to do incremental, online processing and online processing and maintenance
• Uses tilted time frameUses tilted time frame
• Detects outliers when new clusters are formed
Loretta Auvil
Demonstration
Loretta Auvil
Significant Advances In the Areas of Data M t d Mi iManagement and Mining
• Tilted-time window for multi-resolution modelingM l i di i l l i i b hi• Multi-dimensional analysis using a stream cube architecture
• Efficient “one-look” stream data mining algorithms:• classification, frequent pattern analysis, clustering, and , q p y , g,
information visualization• Integration of “one-look” approaches into one stream data mining
platform so they can cooperate to discover patterns and surprising platform so they can cooperate to discover patterns and surprising events in real-time
• Internationally recognized research leadership in the areas of data management mining and knowledge sharingmanagement, mining, and knowledge sharing
• Experience in development of robust software framework supporting advanced, data mining and information visualizationExperience in development of software environments supporting • Experience in development of software environments supporting problem solving and evidence-based decision making
Loretta Auvil
Knowledge Extraction from Streaming Text g g
Information extractionprocess of using advanced • process of using advanced automated machine learning approaches
• to identify entities in text • to identify entities in text documents
• extract this information along with the relationships these pentities may have in the text documents
Thi j t d t t This project demonstrates information extraction of names, places and organizations from real-time organizations from real time news feeds. As news articles arrive, the information is extracted and displayed.
Loretta Auvil
Challenges of Stream Data Processing
Multiple, continuous, rapid, time-varying, ordered streamsp , , p , y g,
Main memory computations
Queries are often continuous Queries are often continuous Evaluated continuously as stream data arrives
Answer updated over time Answer updated over time
Queries are often complex Beyond element-at-a-time processing
Beyond stream-at-a-time processing
Beyond relational queries (scientific, data mining, OLAP)
Multi-level/multi-dimensional processing and data mining
November 25, 2009 Data Mining: Concepts and Techniques 31
Most stream data are at low-level or multi-dimensional in nature
Processing Stream Queriesg Q
Query types One-time query vs. continuous query (being evaluated
continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line)
Unbounded memory requirements For real-time response, main memory algorithm should be used Memory requirement is unbounded if one will join future tuples
Approximate query answering With bounded memory, it is not always possible to produce exact
answersanswers High-quality approximate answers are desired Data reduction and synopsis construction methods
November 25, 2009 Data Mining: Concepts and Techniques 32
Data reduction and synopsis construction methods Sketches, random sampling, histograms, wavelets, etc.
Methodologies for Stream Data Processing
Major challenges Keep track of a large universe, e.g., pairs of IP address, not ages
MethodologySynopses (trade off between accuracy and storage) Synopses (trade-off between accuracy and storage)
Use synopsis data structure, much smaller (O(logk N) space) than their base data set (O(N) space)
Compute an approximate answer within a small error range(factor ε of the actual answer)
Major methods Major methods Random sampling Histograms Sliding windows Multi-resolution model Sketches
November 25, 2009 Data Mining: Concepts and Techniques 33
Sketches Radomized algorithms
Stream Data Processing Methods (1)
Random sampling (but without knowing the total length in advance) Reservoir sampling: maintain a set of s candidates in the reservoir,
which form a true random sample of the element seen so far in the stream. As the data stream flow, every new element has a certain st ea s t e data st ea o , e e y e e e e t as a ce taprobability (s/N) of replacing an old element in the reservoir.
Sliding windowsM k d i i b d l t d t f lidi i d i Make decisions based only on recent data of sliding window size w
An element arriving at time t expires at time t + w Histograms Histograms
Approximate the frequency distribution of element values in a stream Partition data into a set of contiguous buckets Equal-width (equal value range for buckets) vs. V-optimal (minimizing
frequency variance within each bucket) Multi-resolution models
November 25, 2009 Data Mining: Concepts and Techniques 34
Multi resolution models Popular models: balanced binary trees, micro-clusters, and wavelets
Stream Data Processing Methods (2)g ( ) Sketches
Hi t d l t i lti th d t b t k t h Histograms and wavelets require multi-passes over the data but sketches can operate in a single pass
Frequency moments of a stream A = {a1, …, aN}, Fk:
v
i
kik mF
1
where v: the universe or domain size, mi: the frequency of i in the sequence
Given N elts and v values, sketches can approximate F0, F1, F2 in O(log v + log N) spaceO(log v + log N) space
Randomized algorithms Monte Carlo algorithm: bound on running time but may not return correct
result Chebyshev’s inequality:
Let X be a random variable with mean μ and standard deviation σ2
2
)|(|k
kXP Let X be a random variable with mean μ and standard deviation σ
Chernoff bound: Let X be the sum of independent Poisson trials X1, …, Xn, δ in (0, 1]
4/2
|])1([ eXP
November 25, 2009 Data Mining: Concepts and Techniques 35
The probability decreases expoentially as we move from the mean
Approximate Query Answering in Streamspp Q y g
Sliding windowsg Only over sliding windows of recent stream data Approximation but often more desirable in applications
Batched processing, sampling and synopses Batched if update is fast but computing is slow
Compute periodically not very timely Compute periodically, not very timely Sampling if update is slow but computing is fast
Compute using sample data, but not good for joins, etc. Compute using sample data, but not good for joins, etc. Synopsis data structures
Maintain a small synopsis or sketch of data Good for querying historical data
Blocking operators, e.g., sorting, avg, min, etc.
November 25, 2009 Data Mining: Concepts and Techniques 36
Blocking if unable to produce the first output until seeing the entire input
Projects on DSMS (Data Stream Management System)Management System)
Research projects and system prototypes
STREAMSTREAM (Stanford): A general-purpose DSMS
CougarCougar (Cornell): sensors
AuroraAurora (Brown/MIT): sensor monitoring, dataflow
Hancock Hancock (AT&T): telecom streams
NiagaraNiagara (OGI/Wisconsin): Internet XML databases
OpenCQOpenCQ (Georgia Tech): triggers, incr. view maintenance
TapestryTapestry (Xerox): pub/sub content-based filtering
TelegraphTelegraph (Berkeley): adaptive engine for sensors
TradebotTradebot (www.tradebot.com): stock tickers & streams
TribecaTribeca (Bellcore): network monitoring
November 25, 2009 Data Mining: Concepts and Techniques 37
MAIDS MAIDS (UIUC/NCSA): Mining Alarming Incidents in Data Streams
Stream Data Mining vs. Stream Queryingg Q y g
Stream mining—A more challenging task in many cases It shares most of the difficulties with stream querying
But often requires less “precision”, e.g., no join, q p , g , j ,grouping, sorting
Patterns are hidden and more general than querying It may require exploratory analysis
Not necessarily continuous queries Stream data mining tasks
Multi-dimensional on-line analysis of streamsy Mining outliers and unusual patterns in stream data Clustering data streams
November 25, 2009 Data Mining: Concepts and Techniques 38
g Classification of stream data
Concept driftConcept drift
• In many applications, the distribution underlying the instances or the rules y gunderlying their labeling may change over time i e the goal of the prediction the classtime, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted may change over time Thispredicted, may change over time. This problem is referred to as concept drift.
November 25, 2009 Data Mining: Concepts and Techniques 39
Episode Rules
• Association rules applied to sequences of events.
• Episode – set of event predicates and partial ordering on themordering on them
© Prentice Hall 40
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
BasicsBasics
•• Association rules describe how things occur together in Association rules describe how things occur together in the datathe datathe datathe data– E.g., "IF an alarm has certain properties, THEN it will
have other given properties"have other given properties
•• Episode rules describe temporal relationships betweenEpisode rules describe temporal relationships between•• Episode rules describe temporal relationships between Episode rules describe temporal relationships between thingsthings– E g "IF a certain combination of alarms occurs withinE.g., IF a certain combination of alarms occurs within
a time period, THEN another combination of alarms will occur within a time period"
Course on Data MiningCourse on Data Mining 41Page41/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
BasicsBasics
Network Management SystemNetwork Management System Switched NetworkSwitched Network
MSC MSCMSCMSC
BSC BSCBSCBSCAccess NetworkAccess Network
BTSBTS BTSBTSBTSBTSMSCMSC Mobile station controllerm
s
BSCBSC
BTSBTS
Base station controller
Base station transceiver
Ala
rm
Course on Data MiningCourse on Data Mining 42Page42/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
BasicsBasics
•• As defined earlier, telecom data contains alarms:As defined earlier, telecom data contains alarms:1234 EL1 PCM 940926082623 A1 ALARMTEXT1234 EL1 PCM 940926082623 A1 ALARMTEXT..
Alarm type Date, time Alarm severity class
•• Now we forget about relationships between attributesNow we forget about relationships between attributesAlarm number
Alarming network element
Now we forget about relationships between attributes Now we forget about relationships between attributes within alarms as with the association ruleswithin alarms as with the association rules
•• We just take the alarm number attribute, handle it hereWe just take the alarm number attribute, handle it hereWe just take the alarm number attribute, handle it here We just take the alarm number attribute, handle it here as event/alarm type and inspect the relationships as event/alarm type and inspect the relationships between events/alarmsbetween events/alarms
Course on Data MiningCourse on Data Mining 43Page43/54
EpisodesEpisodes
• Partially ordered set of pages
• Serial episode – totally ordered with timeSerial episode totally ordered with time constraint
P ll l i d i l d d i h i• Parallel episode – partial ordered with time constraint
• General episode – partial ordered with no time constrainttime constraint
© Prentice Hall 44
DAG for EpisodeDAG for Episode
© Prentice Hall 45
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
BasicsBasics
•• Data:Data:Data is a set R of events– Data is a set R of events
– Every event is a pair (A, t), where• A R is the event type (e g alarm type)• A R is the event type (e.g., alarm type)• t is an integer, the occurrence time of the event
– Event sequence s on R is a triple (s, Ts, Te)Event sequence s on R is a triple (s, Ts, Te)• Ts is starting time and Te is ending time• Ts < Te are integerss e g• s = (A1, t1), (A2, t2), …, (An, tn) • Ai R and Ts ti < Te for all i=1, …, n
Course on Data MiningCourse on Data Mining 46Page46/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
BasicsBasics
•• Example alarm data sequence:Example alarm data sequence:
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
D C A B D A B C A D C A B D A
•• Here:Here:– A, B, C and D are event (or here alarm) types, , ( ) yp– 10…150 are occurrence times– s = (D, 10), (C, 20), …, (A, 150) ( ) ( ) ( )– Ts (starting time) = 10 and Te (ending time) = 150
•• Note: There needs Note: There needs notnot to be events on every time slot!to be events on every time slot!
Course on Data MiningCourse on Data Mining 47Page47/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
BasicsBasics
•• Episodes:Episodes:A i d i i (V )– An episode is a pair (V, )
• V is a collection of event types, e.g., alarm typesi i l d• is a partial order on V
– Given a sequence S of alarms, an episode = (V, )ithi S if th i f ti f i th toccurs within S if there is a way of satisfying the event
types (e.g., alarm types) in V using the alarms of S so that the partial order is respectedthat the partial order is respected
– Intuitively: episodes consist of alarms that have certain properties and occur in a certain partial order
Course on Data MiningCourse on Data Mining 48Page48/54
p p p
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
BasicsBasics
•• The most useful partial orders are:The most useful partial orders are:T t l d– Total orders
• The predicates of each episode have a fixed orderh i d ll d l ( d d )• Such episodes are called serial (or "ordered")
– Trivial partial orders• The order of predicates is not considered• Such episodes are called parallel (or "unordered")
•• Complicated?Complicated?– Not really, let's take some clarifying examples
Course on Data MiningCourse on Data Mining 49Page49/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
BasicsBasics
•• Examples:Examples:
A B A A
B BC
Serial episode Parallel More complex pepisode
pepisode with
serial and parallel
Course on Data MiningCourse on Data Mining 50Page50/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
•• The name of the WINEPI method comes from the The name of the WINEPI method comes from the technique it uses: a sliding windowtechnique it uses: a sliding windowtechnique it uses: a sliding windowtechnique it uses: a sliding window
•• Intuitively: Intuitively: – A window is slided through the event-based dataA window is slided through the event-based data
sequence– Each window "snapshot" is like a row in a databasep– The collection of these "snapshots" forms the rows in
the database•• Complicated?Complicated?
– Not really, let's take a clarifying example
Course on Data MiningCourse on Data Mining 51Page51/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
•• Example alarm data sequence:Example alarm data sequence:
0 10 20 30 40 50 60 70 80 90
D C A B D A B C
•• The window width is 40 seconds, last point excluded The window width is 40 seconds, last point excluded •• The first/last window contains only the first/last eventThe first/last window contains only the first/last event
Course on Data MiningCourse on Data Mining 52Page52/54
yy
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
• Formally, given a set E of event types an event sequence an event sequence SS = (= (ss TT TT )) is an ordered sequence of events event suchS S = (= (ss,,TTss,T,Tee)) is an ordered sequence of events eventi such that eventi eventi+1 for all i=1, …, n-1, and Ts eventi < Te for all i=1, …, ne , ,
event1 event2 event3 … … eventn
T TTs Te
t1 t2 t3 … … tn
Course on Data MiningCourse on Data Mining 53Page53/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
• Formally, a windowwindow on event sequence S is an event sequence S=(w t t ) where t < T t > T and w consistssequence S=(w,ts,te), where ts < Te, te > Ts, and w consists of those pairs (event, t) from s where ts t < te
• The value t t < t is called window width W• The value ts t < te is called window width, W
event1 event2 event3 … … eventn
T TTs Te
t1 t2 t3 tnWWttss ttee
Course on Data MiningCourse on Data Mining 54Page54/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
• By definition, the first and the last windows on a sequence extend outside the sequence so that the last windowextend outside the sequence, so that the last window contains only the first time point of the sequence, and the last window only the last time pointy p
event1 event2 event3 … … eventn
T T
WWttss ttee
Ts Te
t1 t2 t3 tnWWttss ttee
Course on Data MiningCourse on Data Mining 55Page55/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
• The frequencyfrequency (cf. support with association rules) of an episode is the fraction of windows in which the episodeepisode is the fraction of windows in which the episode occurs, i.e.,
|Sw W(S, W) | occurs in Sw |fr(, S, W) =
|W(S W)||W(S, W)|
where W(S, W) is the set of all windows Sw of sequence S such that the window width is Wsuch that the window width is W
Course on Data MiningCourse on Data Mining 56Page56/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
• When searching for the episodes, a frequency thresholdfrequency threshold(cf support threshold with association rules) min fr is used(cf. support threshold with association rules) min_fr is used
• Episode is frequent if fr(, s, win) min_fr, i.e, "if the freq enc of e ceeds the minim m freq enc thresholdfrequency of exceeds the minimum frequency threshold within the data sequence s and with window width win"
F( i i f ) ll ti f f t i d i• F(s, win, min_fr): a collection of frequent episodes in swith respect to win and min_fr
•• Apriori trick holds:Apriori trick holds: if an episode is frequent in an event sequence s, then all subepisodes are frequent
Course on Data MiningCourse on Data Mining 57Page57/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
•• FormallyFormally, an episode rule is as expression , where and are episodes such that is a subepisode of and are episodes such that is a subepisode of
• An episode is a subepisode of ( ), if the graph representation is a subgraph of the representation of representation is a subgraph of the representation of
A AA:
AC:
B B
Course on Data MiningCourse on Data Mining 58Page58/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
• The fraction
fr(, S, W) = frequency of the whole episodefr( S W) = frequency of the LHS episodefr(, S, W) frequency of the LHS episode
is the confidenceconfidence of the WINEPI episode rulep
• The confidence can be interpreted as the conditional pprobability of the whole of occurring in a window, given that occurs in it
Course on Data MiningCourse on Data Mining 59Page59/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
•• Intuitively: Intuitively: WINEPI l lik i i l b i h– WINEPI rules are like association rules, but with an additional time aspect: If events (alarms) satisfying the rule antecedent (leftIf events (alarms) satisfying the rule antecedent (left-hand side) occur in the right order within W time units, then also the rule consequent (right-hand side) occurs inthen also the rule consequent (right hand side) occurs in the location described by , also within W time units
antecedent antecedent consequent [window width] (f, c)consequent [window width] (f, c)
Course on Data MiningCourse on Data Mining 60Page60/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI AlgorithmWINEPI Algorithm•• InputInput: A set R of event/alarmtypes, an event sequence s over R, a set E
of episodes, a window width win, and a frequency threshold min_frO t tO t t Th ll i F( i i f )•• OutputOutput: The collection F(s, win, min_fr)
•• MethodMethod:1. compute C1 := { E | || = 1};p 1 { | | | };2. i := 1;3. while Ci do4 (* t F(s i i f ) { C | f ( s i ) i f }4.(* compute F(s, win, min_fr) := { Ci | fr(, s, win) min_fr};5. i := l+1;6.(** compute Ci:= { E | || = I, and F||(s, win, min_fr) for ||
all E, };
(* = database pass, (** candidate generation
Course on Data MiningCourse on Data Mining 61Page61/54
( database pass, ( candidate generation
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI AlgorithmWINEPI Algorithm
• First problem: given a sequence and a episode, find out whether the episode occurs in the sequencewhether the episode occurs in the sequence
• Finding the number of windows containing an occurrence of the episode can be reduced to this
• Successive windows have a lot in common• How to use this?
– An incremental algorithm– Same idea as for association rules– A candidate episode has to be a combination of two episodes ofA candidate episode has to be a combination of two episodes of
smaller size– Parallel episodes, serial episodes
Course on Data MiningCourse on Data Mining 62Page62/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI AlgorithmWINEPI Algorithm
•• Parallel episodes:Parallel episodes:F h did i i– For each candidate maintain a counter .event_count: how many events of are present in the windowWhen t t becomes eq al to || indicating– When .event_count becomes equal to ||, indicating that is entirely included in the window, save the starting time of the window in .inwindowstarting time of the window in .inwindow
– When .event_count decreases again, increase the field .freq count by the number of windows where f q_ yremainded entirely in the window
•• Serial episodes: use a state automataSerial episodes: use a state automata
Course on Data MiningCourse on Data Mining 63Page63/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
•• Example alarm data sequence:Example alarm data sequence:
0 10 20 30 40 50 60 70 80 90
D C A B D A B C
•• The window width is 40 secs, The window width is 40 secs, movement stepmovement step 10 secs 10 secs •• The length of the sequence is 70 secs (10The length of the sequence is 70 secs (10--80)80)
Course on Data MiningCourse on Data Mining 64Page64/54
g q (g q ( ))
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
•• By sliding the window, we'll get 11 windows (UBy sliding the window, we'll get 11 windows (U11--UU1111): ):
U2
…
U1
U2U11
0 10 20 30 40 50 60 70 80 90
D C A B D A B C
•• Frequency threshold is set to 40%, i.e., an episode has Frequency threshold is set to 40%, i.e., an episode has to occur at least in 5 of the 11 windowsto occur at least in 5 of the 11 windows
Course on Data MiningCourse on Data Mining 65Page65/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
Course on Data MiningCourse on Data Mining 66Page66/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
•• Suppose that the task is to find all parallel episodes:Suppose that the task is to find all parallel episodes:Fi t t i l t i ll l i d f i 1 (A B C D)– First, create singletons, i.e., parallel episodes of size 1 (A, B, C, D)
– Then, recognize the frequent singletons (here all are)– From those frequent episodes, build candidate episodes of size 2:From those frequent episodes, build candidate episodes of size 2:
AB, AC, AD, BC, BD, CD– Then, recongize the frequent parallel episodes (here all are)– From those frequent episodes, build candidate episodes of size 3:
ABC, ABD, ACD, BCD – When recognizing the frequent episodes, only ABD occurs in more g g q p , y
than four windows– There are no candidate episodes of size four
Course on Data MiningCourse on Data Mining 67Page67/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
•• Episode frequencies and example rules with WINEPI:Episode frequencies and example rules with WINEPI:
D 73%D : 73%C : 73%A : 64%B : 64% D A [40] (55%, 75%)D C : 45%D A : 55%D A : 55%D B : 45% D A B [40] (45%, 82%)C A : 45%C B 45%C B : 45%A B : 45%D A B : 45%
Course on Data MiningCourse on Data Mining 68Page68/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI: Experimental ResultsWINEPI: Experimental Results
•• Data:Data:Al f l i i k– Alarms from a telecommunication network
– 73 000 events (7 weeks), 287 event typesll l d i l i d– Parallel and serial episodes
– Window widths (W) 10-120 seconds– Window movement = W/10– min_fr = 0.003 (0.3%), frequent: about 100 occurrences– 90 MHz Pentium, 32MB memory, Linux operating
system. The data resided in a 3.0 MB flat text file
Course on Data MiningCourse on Data Mining 69Page69/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI: Experimental ResultsWINEPI: Experimental Results
Window Serial episodes Parallel episodesWindow Serial episodes Parallel episodeswidth (s) #frequent time (s) #frequent time (s)10 16 31 10 820 31 63 17 940 57 117 33 1460 87 186 56 1560 87 186 56 1580 145 271 95 21100 245 372 139 21100 245 372 139 21120 359 478 189 22
Course on Data MiningCourse on Data Mining 70Page70/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
WINEPI ApproachWINEPI Approach
•• One shortcoming in WINEPI approach:One shortcoming in WINEPI approach:C id h l f A d l f– Consider that two alarms of type A and one alarm of type B occur in a windowDoes the parallel episode consisting of A and B appear– Does the parallel episode consisting of A and B appear once or twice?If once then with which alarm of type A?– If once, then with which alarm of type A?
0 10 20 30 40 50 60 70 80 90
D C A B D A B C
Course on Data MiningCourse on Data Mining 71Page71/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
MINEPI ApproachMINEPI Approach
•• Alternative approach to discovery of episodesAlternative approach to discovery of episodesN lidi i d– No sliding windows
– For each potentially interesting episode, find out the exact occurrences of the episodeexact occurrences of the episode
•• Advantages:Advantages: easy to modify time limits, several time limits for one rule:limits for one rule:
"If A and B occur within 15 seconds, then C follows within 30 seconds"within 30 seconds
•• Disadvantages:Disadvantages: uses a lots of space
Course on Data MiningCourse on Data Mining 72Page72/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
MINEPI ApproachMINEPI Approach
• Formally, given a episode and an event sequence S, the interval [t t ] is a minimal occurrenceminimal occurrence of Sinterval [ts,te] is a minimal occurrenceminimal occurrence of S,– If occurs in the window corresponding to the interval
If does not occ r in an proper s binter al– If does not occur in any proper subinterval
Th t f i i lt f i i l f i d i• The set of minimal occurrencesset of minimal occurrences of an episode in a given event sequence is denoted by mo():
mo() = { [ts,te] | [ts,te] is a minimal occurrence of }
Course on Data MiningCourse on Data Mining 73Page73/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
MINEPI ApproachMINEPI Approach
• Example: Parallel episode consisting of event types Aand B has three minimal occurrences in s: {[30 40]and B has three minimal occurrences in s: {[30,40], [40,60], [60,70]}, has one occurrence in s: {[60,80]}
A AC: :
B B
D C A B D A B C
0 10 20 30 40 50 60 70 80 90
Course on Data MiningCourse on Data Mining 74Page74/54
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
MINEPI ApproachMINEPI Approach
•• InformallyInformally, a MINEPI episode rule gives the conditional probability that a certain combination of events (alarms)probability that a certain combination of events (alarms) occurs within some time bound, given that another combi-nation of events (alarms) has occurred within a time bound( )
• Formally, an episode ruleepisode rule is [win1] [win2]• and are episodes such that ( is a subepisode of and are episodes such that ( is a subepisode of
)• If episode has a minimal occurrence at interval [ts,te] p [ s e]
with te - ts win1, then episode occurs at interval [ts,t'e] for some t'e such that t'e - ts win2
Course on Data MiningCourse on Data Mining 75Page75/54
Pattern DiscoveryPattern Discovery1. Choose the language (formalism) to represent
the patterns (search space)the patterns (search space)
2. Choose the rating for patterns, to tell which is “better” than others
3 Design an algorithm that finds the best3. Design an algorithm that finds the best patterns from the pattern class, fast.
Brazma A, Jonassen I, Eidhammer I, Gilbert D.Brazma A, Jonassen I, Eidhammer I, Gilbert D.Approaches to the automatic discovery of patterns in biosequences.J Comput Biol. 1998;5(2):279-305.
Level 0 ATCGCTGAATTCCAATGTG
Level 1Eukaryotic genome can be
Level 2genome can be thought of as six Levels of DNA structureLevel 3 structure.
The loops at L l 4
Level 4Level 4 range from 0.5kb to 100kb in length.
Level 5If these loops were stabilized then the genes inside the loop would not be expressed.
Level 6expressed.
DNA determines function (?)DNA determines function (?)
DNAGenBank / EMBL Bank
ProteinSwissProt/TrEMBL
StructurePDB/Molecular Structure DatabaseGenBank / EMBL Bank SwissProt/TrEMBL PDB/Molecular Structure Database
4 Nucleotides 20+ Amino Acids4 Nucleotides 20 Amino Acids(3nt 1 AA)
Function?
A Simple GeneA Simple Gene
A: B: C:
ATCGAAAT +M difi ti
Upstream/promoter
Downstream
DNA ATCGAAATTAGCTTTA
+ModificationsDNA:
Species and individualsSpecies and individuals
• Animals, plantsfungi, bacteria, …g
S i• Species
• Individuals
www.tolweb.org
Gene Regulatory Signal Finding
Transcription Factor
Transcription Factor Binding Site
Goal: Detect Transcription Factor Binding SitesGoal: Detect Transcription Factor Binding Sites.Eleazar Eskin: Columbia Univ.
TGTTCTTTCTTCTTTCATACATCCTTTTCCTTTTTTTCCTTCTCCTTTCATTTCCTGACTTTTAATATAGGCTTACCATTCTCCTTTCATTTCCTGACTTTTAATATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTACATTGCTTCTTCTTCGATTGCTTCAAAGTAGTTCGTGAATCATCCTTCAATGCCTCAGCACCTTCAGCACTTGCACTTCATTCTCTGGAAGCCTCAGCACCTTCAGCACTTGCACTTCATTCTCTGGAAGTGCTGCACCTGCGCTGTCTTGCTAATGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGACATGGGCGGCGTCTTCTTCGAATTCCATCAGTCCTCATAGTTCTGTTGGTTCTTTTTCGAATTCCATCAGTCCTCATAGTTCTGTTGGTTCTTTTCTCTGATGATCGTCATCTTTCACTGATCTGATGTTCCTGTGCCCTATCTATATCATCTCAAAGTTCACCTTTGCCACTTTCCAAGATCTCTCATTCATAATGGGCTTAAAGCCGTACTTCCAAGATCTCTCATTCATAATGGGCTTAAAGCCGTACTTTTTTCACTCGATGAGCTATAAGAGTTTTCCACTTTTAGATCGTGGCTGGGCTTATATTACGGTGTGATGAGGGCGCTTGAAAAGATTTTTTCATCTCACAAGCGACGAGGGCCCGAGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGTCTTAGAAGATAAAGTAGTGAATTACAATAGATTCGATAC
Patterns: ATPatterns: AT
Patterns: [AT][ACT]ATIUPAC: W H AT
Cluster of co‐expressed genes, di i l ipattern discovery in regulatory regions
600 basepairs
Expression profiles
Retrieve
Upstream regions
Find patterns over-represented within clusterGenome Research 1998; ISMB (Intelligent Systems in Mol. Biol.) 2000
Binomial or hypergeometric distribution
Background -ALL upstream
sequencesCluster:
occurs 3 times
P(3,6,0.2) is probabilityof having 3 matchesof having 3 matches in 6 sequences
0 2P(,3,6,0.2)
5 out of 25, p = 0.2 =0.0989
Pattern vs cluster “strength”Pattern vs cluster strength
The pattern probability vs. the average silhouette for the cluster
The same for randomised clusters
Vilo et.al. ISMB 2000
The most unprobable pattern from best clustersclusters
Pattern Probability Cluster Occurrences Total nr of Ksize in cluster occurrences in K-means
AAAATTTT 2.59E-43 96 72 830 60ACGCG 6 41E 39 96 75 1088 50ACGCG 6.41E-39 96 75 1088 50ACGCGT 5.23E-38 94 52 387 40CCTCGACTAA 5.43E-38 27 18 23 220GACGCG 7.89E-31 86 40 284 38TTTCGAAACTTACAAAAAT 2.08E-29 26 14 18 450TTCTTGTCAAAAAGC 2.08E-29 26 14 18 325ACATACTATTGTTAAT 3.81E-28 22 13 18 280GATGAGATG 5.60E-28 68 24 83 84TGTTTATATTGATGGA 1.90E-27 24 13 18 220GATGGATTTCTTGTCAAAA 5.04E-27 18 12 18 500GATGGATTTCTTGTCAAAA 5.04E 27 18 12 18 500TATAAATAGAGC 1.51E-26 27 13 18 300GATTTCTTGTCAAA 3.40E-26 20 12 18 700GATGGATTTCTTG 3.40E-26 20 12 18 875GGTGGCAA 4.18E-26 40 20 96 180TTCTTGTCAAAAAGCA 5 10E 26 29 13 18 250TTCTTGTCAAAAAGCA 5.10E-26 29 13 18 250CGAAACTTACAAA 5.10E-26 29 13 18 290GAAACTTACAAAAATAAA 7.92E-26 21 12 18 650TTTGTTTATATTG 1.74E-25 22 12 18 600ATCAACATACTATTGT 3.62E-25 23 12 18 375ATCAACATACTATTGTTA 3.62E-25 23 12 18 625GAACGCGCG 4.47E-25 20 11 13 260GTTAATTTCGAAAC 7.23E-25 24 12 18 400GGTGGCAAAA 3.37E-24 33 14 31 475ATCTTTTGTTTATATTGA 7 19E 24 19 11 18 675ATCTTTTGTTTATATTGA 7.19E-24 19 11 18 675TTTGTTTATATTGATGGA 7.19E-24 19 11 18 475GTGGCAAA 1.14E-23 28 18 137 725
Vilo et.al. ISMB 2000
GGTGGCAA - proteasome associated control element
YOR261C YOR261C RPN8 protein degradation 26S proteasome regulatory subunit S0005787 1YDL020C YDL020C RPN4 i d d i bi i i 26S b i S0002178 1YDL020C YDL020C RPN4 protein degradation, ubiquitin26S proteasome subunit S0002178 1YDL007W YDL007W RPT2 protein degradation 26S proteasome subunit S0002165 1YDL147W YDL147W RPN5 protein degradation 26S proteasome subunit S0002306 1YOL038W YOL038W PRE6 protein degradation 20S proteasome subunit (alpha4) S0005398 1YKL145W YKL145W RPT1 protein degradation, ubiquitin26S proteasome subunit S0001628 1YDL097C YDL097C RPN6 protein degradation 26S proteasome regulatory subunit S0002255 1YDR394W YDR394W RPT3 protein degradation 26S proteasome subunit S0002802 1YBR173C YBR173C UMP1 t i d d ti bi iti 20S t t ti f t S0000377 1YBR173C YBR173C UMP1 protein degradation, ubiquitin20S proteasome maturation factor S0000377 1YER012W YER012W PRE1 protein degradation 20S proteasome subunit C11(beta4) S0000814 1YPR108W YPR108W RPN7 protein degradation 26S proteasome regulatory subunit S0006312 1YOR117W YOR117W RPT5 protein degradation 26S proteasome regulatory subunit S0005643 1YJL001W YJL001W PRE3 protein degradation 20S proteasome subunit (beta1) S0003538 1YPR103W YPR103W PRE2 protein degradation 20S proteasome subunit (beta5) S0006307 1YOR157C YOR157C PUP1 protein degradation 20S proteasome subunit (beta2) S0005683 1YGL048C YGL048C RPT6 t i d d ti 26S t l t b it S0003016 1YGL048C YGL048C RPT6 protein degradation 26S proteasome regulatory subunit S0003016 1YHR200W YHR200W RPN10 protein degradation 26S proteasome subunit S0001243 1YML092C YML092C PRE8 protein degradation 20S proteasome subunit Y7 (alpha2 S0004557 1YIL075C YIL075C RPN2 tRNA processing 26S proteasome subunit) S0001337 1YMR314W YMR314W PRE5 protein degradation 20S proteasome subunit(alpha6) S0004931 1YGR253C YGR253C PUP2 protein degradation 20S proteasome subunit(alpha5) S0003485 1YGR135W YGR135W PRE9 protein degradation 20S proteasome subunit Y13 (alpha3) S0003367 1YFR004W YFR004W RPN11 t i ti t ti l b l l t S0001900 1YFR004W YFR004W RPN11 transcription putative global regulator S0001900 1YOR259C YOR259C RPT4 protein degradation 26S proteasome regulatory subunit S0005785 1YFR052W YFR052W RPN12 protein degradation 26S proteasome regulatory subunit S0001948 1YFR050C YFR050C PRE4 protein degradation proteasome subunit, B type S0001946 1YGL011C YGL011C SCL1 protein degradation 20S proteasome subunit YC7ALPHA/Y8 S0002979 1YDR427W YDR427W RPN9 protein degradation 26S proteasome regulatory subunit S0002835 1YOR362C YOR362C PRE10 protein degradation 20S proteasome subunit C1 (alpha7) S0005889 1YBL041W YBL041W PRE7 t i d d ti 20S t b it S0000137 1YBL041W YBL041W PRE7 protein degradation 20S proteasome subunit S0000137 1YER021W YER021W RPN3 protein degradation 26S proteasome regulatory subunit S0000823 1YER094C YER094C PUP3 protein degradation 20S proteasome subunit (beta3 S0000896 1YGR270W YGR270W YTA7 protein degradation 26S proteasome subunit; ATPase S0003502 1YHR027C YHR027C RPN1 protein degradation 26S proteasome regulatory subunit S0001069 1YER047C YER047C SAP1 mating type switching AAA family protein S0000849 1YGR232W YGR232W unknown unknown S0003464 1
>YAL036C chromo=1 coord=(76154-75048(C)) start=-600 end=+2 seq=(76152-76754)
TGTTCTTTCTTCTTCTGCTTCTCCTTTTCCTTTTTTTCCTTCTCCTTTTCCTTCTTGGACTTTAGTATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTTTCTTGCTTCTTCTTCGATTGCTTCAAAGTAGACATGAAGTCGCCTTCAATGGCCTCAGCACCTTCAGCACTTGCACTTGCTTCTCTGGAAGTGTCATCTGCACCTGCGCTGCTTTCTGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGTTCTGGGCGGCGTCTTCTTCGAATTCCTCATCCCAGTAGTTCTGTTGGTTCTTTTTACTCTTTTTCGCCATCTTTCACTTATCTGATGTTCCTGATTGCCCTTCTTATCCCCTCAAAGTTCACCTTTGCCACTTATTCTAGTGC
101 Sequences relative to ORF startYGR128C + 100GTCTTCTTCGAATTCCTCATCCCAGTAGTTCTGTTGGTTCTTTTTACTCTTTTTCGCCATCTTTCACTTATCTGATGTTCCTGATTGCCCTTCTTATCCCCTCAAAGTTCACCTTTGCCACTTATTCTAGTGCAAGATCTCTTGCTTTCAATGGGCTTAAAGCTTGAAAAATTTTTTCACATCACAAGCGACGAGGGCCCGTTTTTTTCATCGATGAGCTATAAGAGTTTTCCACTTTTAAGATGGGATATTACGGTGTGATGAGGGCGCAATGATAGGAAGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGTTGTTGATTGAGCAAA_ATG_>YAL025C chromo=1 coord=(101147-100230(C)) start=-600 end=+2 seq=(101145-101747)CTTAGAAGATAAAGTAGTGAATTACAATAAATTCGATACGAACGTTCAAATAGTCAAGAATTTCATTCAAAGGGTTCAATGGTCCAAGTTTTACACTTTCAAAGTTAACCACGAATTGCTGAGTAAGTGTGTTTATATTAGCACATTAACACAAGAAGAGATTAATGAACTATCCACATGAGGTATTGTGCCACTTTCCTCCAGTTCCCAAATTCCTCTTGTAAAAAACTTTGCATATAAAATATACAGATGGAGCATATATAGATGGAGCATACATACATGTTTTTTTTTTTTTAAAAACATGGACTCGAACAGAATAAAAGAATTTATAATGATAGATAATGCATACTTCAATAAGAGAGAATACTTGTTTTTAAATGAGAATTGCTTTCATTAGCTCATTATGTTCAGATTATCAAAATGCAGTAGGGTAATAAACCTTTTTTTTTTTTTTTTTTTTTTTTGAAAAATTTTCCGATGAGCTTTTGAAAAAAAATGAAAAAGTGATTGGTATAGAGGCAGATATTGCATTGCTTAGTTCTTTCTTTTGACAGTGTTCTCTTCAGTACATAACTACAACGGTTAGAATACAACGAGGAT_ATG_
...>YBR084W chromo=2 coord=(411012-413936) start=-600 end=+2 seq=(410412-411014)CCATGTATCCAAGACCTGCTGAAGATGCTTACAATGCCAATTATATTCAAGGTCTGCCCCAGTACCAAACATCTTATTTTTCGCAGCTGTTATTATCATCACCCCAGCATTACGAACATTCTCCACATCAAAGGAACTTTACGCCATCCAACCAATCGCATGGGAACTTTTATTAAATGTCTACATACATACATACATCTCGTACATAAATACGCATACGTATCTTCGTAGTAAGAACCGTCACAGATATGATTGAGCACGGTACAATTATGTATTAGTCAAACATTACCAGTTCTCGAACAAAACCAAAGCTACTCCTGCAACACTCTTCTATCGCACATGTATGGTTCTTATTGTTTCCCGAGTTCTTTTTTACTGACGCGCCAGAACGAGTAAGAAAGTTCTCTAGCGCCATGCTGAAATTTTTTTCACTTCAACGGACAGCGATTTTTTTTCTTTTTCCTCCGAAATAATGTTGCAGCGGTTCTCGATGCCTCAAGAATTGCAGAAGTAAACCAGCCAATACACATCAAAAAACAACTTTCATTACTGTGATTCTCTCAGTCTGTTCATTTGTCAGATATTTAAGGCTAAAAGGAA_ATG_
GATGAG.T 1:52/70 2:453/508 R:7.52345 BP:1.02391e-33G.GATGAG.T 1:39/49 2:193/222 R:13.244 BP:2.49026e-33AAAATTTT 1 63/77 2 833/911 R 4 95687 BP 5 02807 32AAAATTTT 1:63/77 2:833/911 R:4.95687 BP:5.02807e-32TGAAAA.TTT 1:45/53 2:333/350 R:8.85687 BP:1.69905e-31TG.AAA.TTT 1:53/61 2:538/570 R:6.45662 BP:3.24836e-31TG.AAA.TTTT 1:40/43 2:254/260 R:10.3214 BP:3.84624e-30TGAAA..TTT 1:54/65 2:608/645 R:5.82106 BP:1.0887e-29...
GATGAG TGATGAG.TTGAAA..TTT
Sequence patterns:Sequence patterns: the basis of the SPEXSthe basis of the SPEXS
A G A AT C GC C C
GCAT (4 positions)
GCATA (3 positions)
GCATA.GCATA.C
SPEXS substringsSPEXS ‐ substrings
• enqueue( Q , Empty pattern (occurs everywhere) )
• while P = deque( Q )while P deque( Q ) – check all positions of P.pos
F h i P– For every character c in P.pos• create pattern Pc
• advance all positions in Pc.pos by 1
• enqueue( Q, Pc )
Jaak Vilo and other authors UT: Data Mining 2009 94
SPEXS: count and memorizeSPEXS: count and memorize
i...v....x....v....xabracadabradadabraca
aa{1,4,6,8,11,13,15,18,20}
{2,5,7,9,12,14,16,19,21}
SPEXS: extendSPEXS: extend …
i...v....x....v....xabracadabradadabraca
aa
b{2,5,7,9,12,14,16,19,21}
cb d{5,19}{2,9,16} {7,12,14}
SPEXS: find frequent firstSPEXS: find frequent first
i...v....x....v....xabracadabradadabraca
aa
b{2,5,7,9,12,14,16,19,21}
b d{2,9,16} {7,12,14}
SPEXS: group positionsSPEXS: group positions
i...v....x....v....xabracadabradadabraca
aa
b
{2,5,7,9,12,14,16,19,21}
[bd].
b d [bd]
{2,9,16} {7,12,14} {2,7,9,12,14,16}
The wildcards
GCAT.{3,6}X
The wildcards
GCAT.*X
The wildcards: not too many
w:0
aw:0
.{3.6}b1w:1w:0
SPEXS: general algorithmSPEXS: general algorithm1 S = input sequences ( ||S||=n )1. S input sequences ( ||S|| n )2. e = empty pattern, e.pos = {1,...,n}3. enqueue( order , e )
4. while p = dequeue( order ) 5. generate all allowed extensions p’ of p (& p’.pos)g p p ( p p )6. enqueue( order, p’, priority(p’) ) 7. enqueue( output, p’, fitness(p’) )
8. while p = dequeue( output )9. Output p
Jaak Vilo: Discovering Frequent Patterns from Strings.Technical Report C-1998-9 (pp. 20) May 1998. Department of Computer Science, University of Helsinki.
Applications in bioinformatics:
-Gene regulation (1998: 255+ citations, 2000: 73 cit)
Jaak Vilo: Pattern Discovery from Biosequences PhD Thesis, Department of Computer Science, University of Helsinki, Finland. Report A-2002-3 Helsinki, November 2002, 149 pages
-Functional elements in proteins (2002: 32 cit)
SPEXS S P tt EXh ti S hSPEXS ‐ Sequence Pattern EXhaustive SearchJaak Vilo, 1998, 2002
• User‐definable pattern language: substrings, character groups, wildcards, flexible wildcards (c.f. PROSITE)
• Fast exhaustive search over pattern language ( )• “Lazy suffix tree construction”‐like algorithm (Kurtz, Giegerich)
• Analyze multiple sets of sequences simultaneouslyR t i t h t t f t tt l (i h t)• Restrict search to most frequent patterns only (in each set)
• Reportmost frequent patterns, patterns over‐ or underrepresented in selected subsets, or patterns significant byunderrepresented in selected subsets, or patterns significant by various statistical criteria, e.g. by binomial distribution
30min
Multiple data setsD1 D2 D3
4/3 (6) 3/3 (12) 2/2 (9)
.G.GATGAG.T. 39 seq
.G.GATGAG.T. 39 seq (vs 193) p= 2.5e-33p
-1: .G.GATGAG.T. 61 seq (vs 1292)
-1: .G.GATGAG.T. 61 seq (vs 1292) p= 1.4e-19p
-2: .G.GATGAG.T. 91 seq
-2: .G.GATGAG.T. 91 seq (vs 5464)
-3: .G.GATGAG.T. 98 seq
Jaak Vilo: Pattern Discovery from BiosequencesJaak Vilo: Pattern Discovery from Biosequences PhD Thesis, Department of Computer Science, University of Helsinki, FinlandSeries of Publications A, Report A-2002-3 Helsinki, November 2002, 149 pages
-2: .G.GATGAG.T. 91 seq
These hits result in a PWM:
PWM based on all previous hits, here shown highest-scoring occurrences in blue
All against all approximate matching
For every subsequence of every sequence
Match approximately against all the the sequences.
Approximate hits define PWM matrices (not all positions vary equally).
Look for ALL PWM-s derived from data that are enriched in data set (vs. background).
Hendrik Nigul, Jaak Vilo
Dynamic programmingDynamic programming
• Small nr of edit operations allows to limit the search efficiently around main diagonaly g
Suffix TreeSuffix Tree
AC G T
G
G
T
{1:24,2:12,2:23…}
Trie based ll i t ll i t t hiall against all approximatematching
• trieindex
• trieagrep
• trieallagrep• trieallagrep
• triematrix
Hendrik Nigul, Jaak Vilo