Upload
christiana-williams
View
216
Download
0
Embed Size (px)
Citation preview
Time Series Data Analysis - II
Yaji Sripada
Dept. of Computing Science, University of Aberdeen 2
In this lecture you learn
• Structural representations of time series• SAX
– Computing SAX– Data analysis using SAX– Visualization using SAX
Dept. of Computing Science, University of Aberdeen 3
Introduction
• Time series exhibit an internal structure– Elements of this structure have domain
specific meanings– E.g. the spikes on the gas turbine data
(from last lecture) have domain specific meaning
– The structural elements of a time series are usually approximations (abstractions) of the original data
– Experts in any domain reason in terms of these abstractions and not in terms of the original time series
– Understanding time series = understanding their structure
Dept. of Computing Science, University of Aberdeen 4
Several structural representations
• Time series can be represented in terms of– Linear segments (we already saw this last
week)– Aggregate Approximations (will study in this
lecture)– Non-linear segments (Not in this course)– Wavelets (involve complex mathematics – not
in this course)– And many more
• The primary motivation behind creating the above structural representations is time series data mining
Dept. of Computing Science, University of Aberdeen 5
Which structure is the most useful?
• All these structural representations are useful – may be more used in some application domains than
others• A good representation exhibits meaningful
structure– But meaning is attributed to a structure based on
domain knowledge and user tasks• This means, select a representation that helps
easy computation of meaning• Our approach to selecting the right
representation– Based on the domain KA we learn the trends and
patterns that are meaningful– Select one or more representations that facilitate the
computation of required trends and patterns
Dept. of Computing Science, University of Aberdeen 6
Symbolic Aggregate Approximation (SAX)
• A recently developed symbolic representation of time series is claimed to facilitate easy pattern computation
• http://www.cs.ucr.edu/~eamonn/SAX.htm is the main SAX page
• We introduced this representation in the last lecture
• We study how to create this representation in this lecture because it allows – Novel data analysis of time series and– Novel visualization of time series
• We will study briefly data analysis and visualization with SAX
• The above link has all the required details for further study
Dept. of Computing Science, University of Aberdeen 7
Creating SAX
• Input– Real valued time series
(blue curve)• Output
– Symbolic representation of the input time series (red string)
• Process– First convert the input
series into piecewise aggregate approximation (PAA) representation (grey steps)
– Then convert the PAA into a string of symbols (red string)
baabccbc
Input Series
PAA
SAX
Dept. of Computing Science, University of Aberdeen 8
Example Data
Time Depth
20 4.2
40 9.2
60 14.8
80 15
100 17
120 18
140 19.7
160 20
180 20.8
200 21.3
220 21.6
240 20.6
260 16.9
280 12.8
Dept. of Computing Science, University of Aberdeen 9
Creating PAA
• Normalize the input time series– Subtract the mean from each value and divide the
deviation with standard deviation
• Divide input time series of length n into w portions of equal length– w is the parameter that controls the length of PAA
and therefore the length of SAX– If w is large you have a detailed (fine) PAA and a
detailed SAX– If w is small you have an abstract (coarse) PAA and
an abstract SAX– Choice of w should be based on the application
requirements
Dept. of Computing Science, University of Aberdeen 10
Creating PAA (2)
• Two cases– n/w is a whole number
• Simple case of each portion having n/w number of values from the input time series
– n/w is a fraction• Complicated case because you cannot assign
equal number of whole numbered values from the input series to w equal sized portions
• Our example data has n = 14• If w = 3, then n/w is a fraction• The length of each portion is 14/3 = 4.66667• Each portion should have 4.66667 values from
the original time series
Dept. of Computing Science, University of Aberdeen 11
Creating PAA (3)
• We use the following scheme to achieve 4.6667 values in each portion
• The following is the list of indexes of the 14 values in a input series
1 2 3 4 5 6 7 8 9 10 11 12 13 14
• The first portion will have values at 1, 2, 3, and 4
• We need 0.6667 more to complete this portion• We achieve this by inserting 0.6667 times the
5th value• The remaining 0.3333 times the 5th value is
inserted into the second portion
Dept. of Computing Science, University of Aberdeen 12
Creating PAA (4)• Using the above scheme our three lists are
– 4.2, 9.2, 14.8, 15 and 0.6667*17– 0.3333*17, 18, 19.7, 20, 20.8, 0.3333*21.3– 0.6667*21.3, 21.6, 20.6, 16.9, 12.8
• (Note: here we have shown the values from the un-normalized input series)
• Each of the above sublists have equal portions from the input series
• Next for each of the sublists compute the average (mean)
• In our case, three sublists will each have an average value
• PAA is simply a vector of these average values– {avg1, avg2, avg3}– {-0.9338,0.53135,0.34767} for our example (using
normalized values)
Dept. of Computing Science, University of Aberdeen 13
Properties of PAA• PAA is simple to compute (as can be
seen from the previous slides)• Achieves dimensionality reduction
– From 14 values our input series is reduced to 3 values
• Any similarities computed on the PAA will be true on input series as well– Lower bounding distance– Very useful property for a structural
representation– Allows data analysis to be performed on the
approximate representation rather than the original series
Dept. of Computing Science, University of Aberdeen 14
Symbol Mapping
• In this step, each average value from the PAA vector is replaced by a symbol from an alphabet
• An alphabet size, a of 5 to 8 is recommended– a,b,c,d,e– a,b,c,d,e,f– a,b,c,d,e,f,g– a,b,c,d,e,f,g,h
• Given an average value we need a symbol• This is achieved by using the normal distribution from
statistics– Because our input series is normalized we can use normal
distribution as the data model– We divide the area under the normal distribution into ‘a’
equal sized areas where a is the alphabet size– Each such area is bounded by breakpoints
Dept. of Computing Science, University of Aberdeen 15
Symbol mapping - breakpoints
• Breakpoints for different alphabet sizes can be structured as a lookup table
• When a=3– Average values below -
0.43 are replaced by ‘A’– Average values between
-0.43 and 0.43 are replaced by ‘B’
– Average values above 0.43 are replaced by ‘C’
• Using this table, SAX for our input series is ‘ADD’
a=3 a=4 a=5
b1 -0.43
-0.67
-0.84
b2 0.43 0 -0.25
b3 0.67 0.25
b4 0.84
Dept. of Computing Science, University of Aberdeen 16
SAX Computation – in picturesSAX Computation – in pictures
0 20 40 60 80 100 120
C
C
0
-
-
0 20 40 60 80 100 120
bbb
a
cc
c
a
baabccbc
This slide taken from Eamonn’s Tutorial on SAX
Dept. of Computing Science, University of Aberdeen 17
Data Analysis using SAX
• A general approach is to convert time series into SAX
• Use SAX representations to train Markov models (details not here) on normal data– The model captures the probabilities of
normal patterns
• The trained models are then used to test incoming data for known and unknown patterns
Dept. of Computing Science, University of Aberdeen 18
Visualization using SAX
• Given a SAX representation – count the frequencies of
patterns (substrings) of required length and
– use them to color code a mosaic for visualizing time series
• For example, given ‘baabccbc’ as the SAX representation– We calculate the
frequencies of substrings of length 1 and represent them in a mosaic
• Visualizations for substrings of length>1 are possible (please refer to the SAX site)
a b
c d 2 3
3 0
0.67 1
1 0
Normalize
Mark Frequencies
Color code cells
Dept. of Computing Science, University of Aberdeen 19
Summary
• Structural representations help in understanding time series through– Data analysis + Visualization
• SAX is claimed to be a landmark representation of time series– Symbolic and therefore allows use of
discrete data structures and their corresponding algorithms for analysis
– Also helps with visualization