Upload
oscar-moore
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
23/4/19
ACM SIGMOD 2007
Effective Variation Management for Effective Variation Management for Pseudo Periodical StreamsPseudo Periodical Streams
Lv-an Tang, Bin Cui, Hongyan Li, Gaoshan Miao, Dongqing Yang, Xinbiao Zhou
School of EECS
Peking University
•23/4/19 2/45ACM SIGMOD 2007
SummarySummary
Introduction
Related Work
Variation Management for Pseudo Periodical Stream
Experiments
Conclusion
•23/4/19 3/45ACM SIGMOD 2007
Pseudo Periodical StreamPseudo Periodical Stream
Pseudo Periodical StreamData seems to repeat in a certain period
Tiny variation exists between different periods
Common in the domain of medical, seismology
Typical stream variations: gradual evolutions rather than burst changes
•23/4/19 4/45ACM SIGMOD 2007
An Example of Pseudo Periodical StreamAn Example of Pseudo Periodical Stream
The respiratory data repeats about every 3.2 seconds
Reflects the evolution of the patient’s illness during five hours
•23/4/19 5/45ACM SIGMOD 2007
Variation Management on Data StreamVariation Management on Data Stream
Data streams are widely applied in many domainsStock market analysis
Road traffic control
Medical signal processing
Online variation management -- an important taskWhen did the variation occur? (Detect variations)
What is the variation ? / How does it change? (Describe variations)
Why it turns to change in this way ? (Help understanding variations )
•23/4/19 6/45ACM SIGMOD 2007
Major Technical ChallengesMajor Technical Challenges
Value TypeTraditional Algorithms: Discrete values (enumerative) or Time series (equidistant intervals)
Data stream: consecutive real number with variable sampling frequencies
Training Sets or ModelsSeveral training sets or predefined models
Data stream evolves and the models may not work soon
On the contrary, the system is required to generate such models as output
•23/4/19 7/45ACM SIGMOD 2007
Major Technical Challenges IIMajor Technical Challenges II
Variation TypeNot only on abnormal values and distribution
The structure in a period (shape)
Noises: unpredictable, random
In many applications, the variations are monitored manually
Our contribution: proposing a new method named Pattern Growth Graph (PGG) to detect and store variations over pseudo periodical streams
•23/4/19 8/45ACM SIGMOD 2007
SummarySummary
Introduction
Related Work
Variation Management for Pseudo Periodical Stream
Experiments
Conclusion
•23/4/19 9/45ACM SIGMOD 2007
Data Stream Management SystemsData Stream Management Systems
Data stream work can be loosely classified in two categories: DSMS and Online Data Mining
Data Stream Management Systems (DSMS)Such as STREAM, Aurora, TelegraphCQ…
Mainly focus on completing predefined SQL queries
Not try to find the data features, or to monitor the variations
•23/4/19 10/45ACM SIGMOD 2007
Online data miningOnline data mining
Variation management is an important part of online data mining
Three classes according to the algorithmsSymbolic Approaches
Mathematic Transformation
Predefined Models
Symbolic Approaches: Tarzan and SAXSpace: Put the entire time series/data stream in memory
Precision is not good for SAX
•23/4/19 11/45ACM SIGMOD 2007
Mathematic TransformationMathematic Transformation
Mathematic Transformation: Discrete Wavelet Transform (DWT) and Fast Fourier Transform (FFT)
Require the data length fixed, as well as the sampling frequency (equidistant intervals)
Haar wavelet transform can only perform on 2n data items, e.g, the data length must be 1024 or 2048
Predefined Models: Using Zigzag to detect events in financial streams (SIGMOD 04)
Too domain specific
Users can not provide such models in advance – actually they would like them as the output
•23/4/19 12/45ACM SIGMOD 2007
SummarySummary
Introduction
Related Work
Variation Management for Pseudo Periodical Stream
Experiments
Conclusion
•23/4/19 13/45ACM SIGMOD 2007
Task Specification by Respiration StreamTask Specification by Respiration Stream
Variation : Online detect the stream variation in one pass
Wave: The smallest unit concerned is not a single point, but values in a certain period represented as a wave
Alarms: F is actually the noise caused by body movements
Summary: A summary with acceptable error bound is very helpful
0
200
400
600
800
1000
1200
1400
1 801 1601 2401 3201
Respriation
A B C D E F G H
Time (S)
0 3 6 9 12
•23/4/19 14/45ACM SIGMOD 2007
System FrameworkSystem Framework
01
23
45
6
0123
456
Wave-Pattern Matching
Full Matched Increase
Frequency
Partially Matched Grow Pattern
New PatternUnmatched
OutPut
Wave Stream
Pattern Growth Graph
Wave Splitting
Online Variation Management
Online Update
Stream View
Pattern Evolutions
Data Stream
Send Alarms
•23/4/19 15/45ACM SIGMOD 2007
Wave Splitting IWave Splitting I
Variation: the difference from old data
Detected by comparing the old data and coming stream
Waste too much resources if comparing at each coming item
Just comparing at each wave -- much more efficient
How to divide the stream according to the data features?
•23/4/19 16/45ACM SIGMOD 2007
Wave Splitting IIWave Splitting II
Fixed length window will accumulate error
Observation: The waves start and end at valley points that are smaller than a certain value
•23/4/19 17/45ACM SIGMOD 2007
Upper Bound of Valley PointsUpper Bound of Valley Points
User define
Update with the average value of past valley points
NVUN
iib /)(
1
•23/4/19 18/45ACM SIGMOD 2007
Valley SectionsValley Sections
Valley Section: Approximate flat section represents the time interval between two events
It is also worth to study as one part of the wave
Take the last point of the section as the cut point
•23/4/19 19/45ACM SIGMOD 2007
Two Problems in Online Matching ITwo Problems in Online Matching I
Problem 1: The data stream’s sampling frequency is usually high (>100Hz), waves should be simplified
Problem 2: How to compare two waves with different time lengths, and may not have data at same time point?
A: {(10,0.5), (20, 1.0), (25, 1.3), …(90, 50.5)} 22 data items
B: {(11,0.5), (25, 1.2), (30, 1.7) … (87, 50)} 20 data items
•23/4/19 20/45ACM SIGMOD 2007
Two Problems in Online Matching IITwo Problems in Online Matching II
Solution 1: Piecewise Liner Representation
Make Problem 2 more difficult: patterns are simplified as segments, how to compare segments and points?
•23/4/19 21/45ACM SIGMOD 2007
Wave-pattern MatchingWave-pattern Matching
In real applications, two sequences are assumed to match if their paths roughly coincide
PLR segments record paths of old data
Testing whether the incoming stream items are on the paths
The intensity of variations can be determined by the number of matching items
0 0.5 1.0 1.5 0 0.5 1.0 1.5
Time (s) Time (s)
ECG
•23/4/19 22/45ACM SIGMOD 2007
Record the PatternsRecord the Patterns
Observation: Many patterns just have few partial segments changed
Most stream variations are gradual evolutions rather than burst mutations
Recording by a simple list not only ignores their relationship but also causes storage redundancy
Utilize the similarity among patterns and reuse the unchanged parts
Pattern Growth Graph (PGG) is designed to store patterns and the variation history
•23/4/19 23/45ACM SIGMOD 2007
Pattern Growth GraphPattern Growth Graph
Implemented as bi-directional linked list
Only generate new segments on the un-matched data
New patterns seems to grow from the old one
0
0.5
1
1.5
1 51 101 151
Wave
Pattern 1
Pattern 2
12 3 4 5 6 7
81' 2'
3' 4'
1 2 3 4 5 6 7 8End
Pattern 1( Base Pattern)
1 ' 3 '' 2 ' 4
EndPattern 2
(Growth Pattern)
Start
•23/4/19 24/45ACM SIGMOD 2007
Construct Full Wave-patternConstruct Full Wave-pattern
New Problem: Wave-Pattern matching needs full pattern to compare, while PGG only stores the new parts
Fortunately we can construct the full pattern by propagating the pointers
\\\\8 9 2”1 1’ 1” 2’3’Final
\8 ( Collision! ) 7Start8 9 2”1 1’ 1” 2’3’Step 2
\83’1 9 2”1’ 1” 2’Step 1
End92’1’ 2”1”Step 0
右 2左 2右 1左 1模式8
EndPattern 1
Pattern 2
91 2 3 4 5 6 7
1' 2' 3'
Pattern 3 End1" 2"
CollisionStart
•23/4/19 25/45ACM SIGMOD 2007
Problems for PGG sizeProblems for PGG size
Waves in data stream: N PGG size: k
Time complexity of PGG based matching algorithm is O (k*n)
In the worst case, each incoming wave introduces a new pattern: overall time cost is O (n2)
When PGG becomes larger, the algorithm is time-consuming
PGG is not allowed to take “forgetting functions”Hard to delete in PGG
Some uncommon patterns may have higher domain significance
•23/4/19 26/45ACM SIGMOD 2007
Rank the PatternsRank the Patterns
Observation: The most frequent pattern and its similar patterns have the highest possibility to match the incoming wave
Matching probability factor
The patterns with smaller probability are not deleted, but have lower priority to be compared
When one pattern get a match, system not only increase its own rank, also its “families”
•23/4/19 27/45ACM SIGMOD 2007
Reconstruct the Stream View with PGGReconstruct the Stream View with PGG
Queries on traditional DSMSpredefined, hard to conduct when data items passed by
Answer “the patient's ECG in the past five hours”
Record all patterns’ occurrence time in PGG
Reconstruct the stream view with PGG patterns
Only consumes about 4% storage space of the original stream, but can provide an approximate stream view within 5% relative error bound
•23/4/19 28/45ACM SIGMOD 2007
Track Pattern EvolutionTrack Pattern Evolution
To answer “Why will it change in this way ?”
User selects an interesting pattern, PGG can track the source of it
•23/4/19 29/45ACM SIGMOD 2007
False AlarmFalse Alarm
A successful system needs to reduce the false alarms introduced by noises
The major problem: noises are caused by many sources, they have various styles and are hard to be modeled
•23/4/19 30/45ACM SIGMOD 2007
Noise ReorganizationNoise Reorganization
A short cut: considering the pattern’s evolution history
Some strategies to reduce false alarms on medical stream:
Unusual values in growth patterns: the patients’ condition has been exacerbated -- Warning
New pattern, it matches successive waves: the underlying pathology mechanism might have some fundamental changes -- Warning
A series of new patterns and they all un-match the previous/following waves -- suspected as noises
•23/4/19 31/45ACM SIGMOD 2007
System FrameworkSystem Framework
01
23
45
6
0123
456
Wave-Pattern Matching
Full Matched Increase
Frequency
Partially Matched Grow Pattern
New PatternUnmatched
OutPut
Wave Stream
Pattern Growth Graph
Wave Splitting
Online Variation Management
Online Update
Stream View
Pattern Evolutions
Data Stream
Send Alarms
•23/4/19 32/45ACM SIGMOD 2007
SummarySummary
Introduction
Related Work
Variation Management for Pseudo Periodical Stream
Experiments
Conclusion
•23/4/19 33/45ACM SIGMOD 2007
Experimental SetupExperimental Setup
Data SetMedical streams: Six real pathology signals including ECG, respiration... (over 25,000,000 data points)
Earthquake waves: The pacific earthquake wave data from the NGA project. (100,000 data points)
Sunspot data: All the sunspot records between the year 1850 and 2001 (55,000 data points)
Environment: Intel Pentium 4 3.0GHz CPU with 1GB RAM, Windows XP Professional, JDK 1.5.0…
•23/4/19 34/45ACM SIGMOD 2007
Effect of Rank FunctionEffect of Rank Function
At the beginning, the effect is insignificant.
After three million data points, the naive algorithm’s performance decreases rapidly
In the end, the rank algorithm outperforms by about 300%
•23/4/19 35/45ACM SIGMOD 2007
Reconstruct the Stream ViewReconstruct the Stream View
ECG data stream (more than 10M data items) can be represented with only 420 patterns
The amazing compressing result is achieved due to two factorsThe PLR simplify can reduce the size of patterns to about 20%
PGG further reduces it to about 3.31% by compressing the repeating and similar patterns (Patterns only need 0.3%, the rest 3% stores the occurrence time of the patterns)
•23/4/19 36/45ACM SIGMOD 2007
Compared with Other MethodsCompared with Other Methods
Compared PGG with SAX (symbolic approaches), Discrete Haar Wavelet Transformation (mathematic transformation) and Zigzag (predefined models)
The processing efficiency is average 60K—70K items/sec
Much higher than real application needs
•23/4/19 37/45ACM SIGMOD 2007
Variation Detection & Noise RecognitionVariation Detection & Noise Recognition
Two important measurements:Sensitivity (High Positive Rate): The algorithm send alarms at meaningful variations
Selectivity (Low Negative Rate): The algorithm does not send false alarms on noises
The two measurements are conflict Increasing sensitivity to find more variations will inevitably cause more false alarms
In a medical environment, sensitivity is much more important -- missing a meaningful variation may cost the patient’s life
•23/4/19 38/45ACM SIGMOD 2007
Best Results of Best Results of Sensitivity Sensitivity on Respiration Stream on Respiration Stream
Zigzag sends false alarm at almost every noise section
DWT and SAX nearly cannot distinguish real variations from noises
•23/4/19 39/45ACM SIGMOD 2007
Results of Noise Recognition on Other StreamResults of Noise Recognition on Other Stream
For other stream, we take precision as the main measurement
PGG performs accurately and stably
Zigzag is volatile with different datasets:Good on three blood pressure signals (ABP, CVP and ICP, meaningful variations are outliners)
Poorly on PLETH (meaningful variations are of inner structures)
•23/4/19 40/45ACM SIGMOD 2007
DiscussionDiscussion
Zigzag: focuses on extreme data points, strongly influenced by outliers
SAX: good at finding in a long period using frequency statistics -- more suitable for time series
DWT: only effective for signals with strict periods
With the effective data structure, PGG discovers and records as much features of the data stream as possible
The recorded information helps distinguish between meaningful variations and noises
•23/4/19 41/45ACM SIGMOD 2007
SummarySummary
Introduction
Related Work
Variation Management for Pseudo Periodical Stream
Experiments
Conclusion
•23/4/19 42/45ACM SIGMOD 2007
ConclusionConclusion
Streams are split as waves and represented by PLR patterns
Detect variations by online wave-pattern matching
Pattern Growth Graph stores the variation history
Reconstruct the stream view with high accuracy
Effectively distinguish meaningful variations from noises
•23/4/19 44/45ACM SIGMOD 2007
Future WorkFuture Work
Extend PGG to multiple streams
Implement the PGG method in other application domains such as weather forecasting and financial analysis
Combine with other methods, like Zigzag…