56
Categorical Time Series: Analysis, Modelling, Monitoring? Christian H. Weiß Department of Mathematics, Darmstadt University of Technology

Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series:Analysis, Modelling, Monitoring?

Christian H. Weiß

Department of Mathematics,

Darmstadt University of Technology

Page 2: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

CategoricalTime Series

Motivation

Page 3: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Motivation

Log data of a web server:

Several types of categorical features (IP addresses, web ad-

dresses, state codes, etc.).

(→ ENBIS talk in 2006)

Possible applications:

• Measuring success of a web site,

• intrusion detection.

Christian H. Weiß — Darmstadt University of Technology

Page 4: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Motivation

Medical Diagnoses

of an examiner for a certain type of examination.

(→ ENBIS talk in 2009)

Possible applications:

• Compare diagnosis behavior of different examiners,

• compare examiner with a norm profile.

Christian H. Weiß — Darmstadt University of Technology

Page 5: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Motivation

Text:

Sequence of letters, words, word classes, grammatical enti-

ties, etc.

Possible applications:

• Language recognition,

• text recognition,

• part-of-speech tagging, etc.

Christian H. Weiß — Darmstadt University of Technology

Page 6: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Motivation

Genetic or protein sequences.

Possible applications:

• Structure analysis to understand functionality of seg-

ments of the sequence,

• recognition of similar sequences,

• recognition of common evolutionary roots,

• identification of sequences through comparison to con-

sensus model.

Christian H. Weiß — Darmstadt University of Technology

Page 7: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Motivation

SPC-related examples.

Xt = result of inspection of item, with

Xt = i for i = 1, . . . ,m iff item has nonconformity type i,

Xt = 0 iff conforming.

Mukhopadhyay (2008): m = 6 paint defects of ceiling fan

cover (‘poor covering’, ‘bubbles’, etc.).

Overall defect category = most predominant defect.

Ye et al. (2002) monitor network traffic data (284 different

types of audit events) for intrusion detection.

Christian H. Weiß — Darmstadt University of Technology

Page 8: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

CategoricalRandom Variables

Properties

Page 9: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Random Variables — Properties

Categorical random variable:X takes one of finite number of unordered categories,

say b0, . . . , bm. E. g.,

• diagnoses b0, . . . , bm, or

• parts of speech b0, . . . , bm, or

• types of defects b0, . . . , bm, or . . .

To simplify notations:

Range of X is coded as V = {0, . . . ,m}.

Christian H. Weiß — Darmstadt University of Technology

Page 10: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Random Variables — Properties

X categorical r.v. with range V = {0, . . . ,m}⇒ P (X = 0) = 1− ∑m

j=1P (X = j).

Abbreviate marginal probabilities: pi := P (X = i) ∈ (0; 1),

whole distribution determined by m parameters p1, . . . , pm.

1st problem: Number m of parameters might be very large,

in contrast to real-valued or count-data random variables,

no sparse parametric models available yet!

. . . except trivial cases:

one-point distribution, uniform distribution.

Christian H. Weiß — Darmstadt University of Technology

Page 11: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Random Variables — Properties

Typical for cardinal random variables:

quantify basic properties, e. g.,

• location via mean or median,

• dispersion via variance or quartiles.

But:

• no arithmetic operations for categorical range

⇒ no mean, variance, etc.

• unordered range ⇒ no quantiles.

Christian H. Weiß — Darmstadt University of Technology

Page 12: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Random Variables — Properties

So how can we quantify

location and dispersion

of a categorical random variable?

Measures of location:

only mode of X in use, i. e.,

value i ∈ V such that pi ≥ pj for all j ∈ V.

Often not uniquely determined (e. g., uniform distribution).

Christian H. Weiß — Darmstadt University of Technology

Page 13: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Random Variables — Properties

Intuitive understanding of dispersion:

X shows large dispersion

High uncertainty about the outcome of X

⇒ Uncertainty of categorical random variable?

Christian H. Weiß — Darmstadt University of Technology

Page 14: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Random Variables — Properties

Two extreme cases:

Uniform distribution:

Maximal uncertainty about the outcome of X.

One-point distribution:

Perfect certainty about the outcome of X.

⇒ Hallmarks for definition of any measure of dispersion!

Contributions in literature (desirable properties, measures),

e. g., by Uschner (1987), Vogel & Kiesl (1999).

Christian H. Weiß — Darmstadt University of Technology

Page 15: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Random Variables — Properties

Common standardized measures of dispersion:

Gini index: νG(X) :=m+1

m(1 −

∑jp2j ).

Entropy: νE(X) := −1

ln (m+1)

∑jpj ln pj.

Chebycheff dispersion: νC(X) :=m+1

m(1 − maxj pj).

Christian H. Weiß — Darmstadt University of Technology

Page 16: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Random Variables — Properties

Important properties of these measures:

• continuous and symmetric functions of p,

• range [0; 1],

• maximum value 1 in case of uniform distribution,

• minimal value 0 in case of one-point distribution,

• inequality: m+1m (1 − minj pj) ≥ νG(X) ≥ νC(X).

Christian H. Weiß — Darmstadt University of Technology

Page 17: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Random Variables — Properties

Empirical measures of dispersion

based on sample X1, . . . , XT .

Binarization Y t ∈ {0,1}m+1 of Xt via Yt,i := δi,Xt.

Unbiased estimator for pi: p̂i(T ) := 1T · ∑T

t=1 Yt,i.

Abbreviate p := (p0, . . . , pm)⊤,

p̂(T ) :=(p̂0(T ), . . . , p̂m(T )

)⊤= 1

T · ∑t Y t,

and sk(p) :=∑j p

kj for k ∈ N.

Christian H. Weiß — Darmstadt University of Technology

Page 18: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Random Variables — Properties

Weiß (2011): Empirical Gini index defined by

ν̂G(X) := m+1m · T

T−1 ·(1− s2

(p̂(T )

)).

If computed from i.i.d. data, then

ν̂G(X) is exactly unbiased and asymptotically normally dis-

tributed with variance determined by

V[1− s2

(p̂(T )

)]= 4

T ·(s3(p)− s22(p)

)+ O(T−2).

Current research:

ν̂G(X) and ν̂E(X) for NDARMA processes (see below).

Christian H. Weiß — Darmstadt University of Technology

Page 19: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

CategoricalTime Series

Terms & Notations

Page 20: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Terms & Notations

Categorical process:

(Xt)N with N = {1,2, . . .}, where each Xt takes one of finite

number of unordered categories.

Categorical time series:

Realizations (xt)t=1,...,T from (Xt)N.

(Xt)N said to be stationary

if joint distribution of (Xt, . . . , Xt+k)

independent of t for all k ∈ N0.

Christian H. Weiß — Darmstadt University of Technology

Page 21: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Terms & Notations

Notations for time-invariant probabilities:

If (Xt)N stationary:

• marginal probabilities pi := P (Xt = i) ∈ (0; 1).

p := (p0, . . . , pm)⊤, and

sk(p) :=∑j p

kj for k ∈ N; obviously s1(p) = 1.

• bivariate probabilities pij(k) := P (Xt = i,Xt−k = j),

conditional probabilities pi|j(k) := P (Xt = i | Xt−k = j).

Christian H. Weiß — Darmstadt University of Technology

Page 22: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

CategoricalTime Series

Visual Analysis

Page 23: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Visual Analysis

“Little existing work deals directly with categorical time se-

ries analysis, and much less deals with the visualization of

categorical time series.” (Ribler, 1997, p. 11)

Several visual tools for real-valued time series:

line plot, periodogram, correlogram, etc.

At least line plot simple and universal tool!

Christian H. Weiß — Darmstadt University of Technology

Page 24: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Visual AnalysisLinienplot (Bier.sta 1v*422c)

1 51 101 151 201 251 301 351 40140

60

80

100

120

140

160

180

200

220

240

Kon

sum

Christian H. Weiß — Darmstadt University of Technology

Page 25: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Visual Analysis

Visual tools for categorical time series:

only few proposals,

often from computer science and biology,

see survey by Weiß (2008).

In fact problematic: Analogue of line plot?

Lack of a natural order within purely categorical range

⇒ arrangement of range along ordinate

arbitrary and misleading.

Christian H. Weiß — Darmstadt University of Technology

Page 26: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Visual AnalysisLinienplot (Bovine.sta 6v*8419c)

1 11 21 31 41

g

a

c

t

Wer

t

or

Linienplot (Bovine.sta 6v*8419c)

1 11 21 31 41

t

g

a

c

Wer

t

Christian H. Weiß — Darmstadt University of Technology

Page 27: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Visual Analysis

Alternative (Keim & Kriegel, 1996): Map range onto set

of colors or symbols, plot x1, x2, . . . successively on space-

filling curve (line-by-line, column-by-column, Peano-Hilbert

curves, spiral, etc.).

However: These graphs perform poorly, characteristic fea-

tures of time series difficult to recognize, interpretation of

resulting plot is problematic, appearance depends heavily on

choice of underlying curve.

Christian H. Weiß — Darmstadt University of Technology

Page 28: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Visual Analysis

Genome of Bovine leukemia virus:

Christian H. Weiß — Darmstadt University of Technology

Page 29: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Visual Analysis

So how to analyze categorical time series visually?

Proposal by Ribler (1997): Rate evolution graph.

Categorical process (Xt)N, range coded as V = {0, . . . ,m}.

Binarization Y t ∈ {0,1}m+1 with Yt,i = δi,Xt, i = 0, . . . ,m.

Define the cumulated sums Ct :=∑ts=1Y s, i. e.,

Ct,i = number of Xs, s = 1, . . . , t, equal to i.

Christian H. Weiß — Darmstadt University of Technology

Page 30: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Visual Analysis

Rate evolution graph of (Xt)N: (Ribler, 1997)

Multiple line plot of all component series Ct,i, i = 0, . . . ,m,

i. e., all Ct,i are plotted simultaneously into one chart.

Interpretation:

Slope of graphs is estimate for corresp. marginal probability.

If (Xt)N stationary and at most moderately serially depen-

dent, then graphs approximately linear in t

⇒ Simple visual tool for checking stationarity.

Christian H. Weiß — Darmstadt University of Technology

Page 31: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Visual Analysis

Shakespeare’s (1593) poem “Venus and Adonis”:/LQLHQSORW��YHQXVDQGDGRQLV�VWD���Y ����F�

�$�%�&�'�(�)�*�+�,�-�.�/�0�1�2�3�4�5�6�7�8�9�:�;�<�=

� ��� ��� ���� ���� ���� ���� ���� ���� ���� ���� ���� �����

���

���

���

���

���

���

���

Christian H. Weiß — Darmstadt University of Technology

Page 32: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Visual Analysis

Log data (2005) of Statistics server at Univ. of Würzburg:

Access to home directory of five members./LQLHQSORW��DFFHVVBORJB]LHO�VWD���Y ����F�

� ��� ��� ���� ���� ���� ���� ���� ���� ���� ���� ���� �����

���

���

���

���

���

���

���

Christian H. Weiß — Darmstadt University of Technology

Page 33: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Visual Analysis

Some further tools: (see Weiß (2008) for references)

• VizTree or IFS circle transformation for visualizing

string frequencies; (→ ENBIS talk in 2005)

• pattern histograms (e. g., based on runs or cycles);

• categorical control charts (also see below).

However: These tools are very specialized.

Universal instrument (like line plot), providing multiple types

of information at once, still missing.

Christian H. Weiß — Darmstadt University of Technology

Page 34: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

CategoricalTime Series

Serial Dependence

Page 35: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Serial Dependence

Cardinal time series:

Convenient measure of serial dependence:

(Partial) Autocorrelation.

Categorical time series:

Measures of serial dependence?

Christian H. Weiß — Darmstadt University of Technology

Page 36: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Serial Dependence

Weiß & Göb (2008): same strategy as for dispersion mea-

sures, i. e., we start with

Extreme cases:

• Xt, Xt−k (stochastically) independent iff pij(k) = pi ·pj.

• Xt perfectly depends on Xt−k iff for every j = 0, . . . ,m:

conditional distribution of Xt, conditioned on Xt−k = j,

is a one-point distribution.

Christian H. Weiß — Darmstadt University of Technology

Page 37: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Serial Dependence

Weiß & Göb (2008) proposed, among others,

Goodman and Kruskal’s τ : τ(k) =

√√√√√√∑i,j

(pij(k)− pipj

)2pj

(1− s2(p)

) ,

Cramer’s v: v(k) =

√√√√√√ 1

m·∑i,j

(pij(k)− pipj

)2pipj

.

Some properties: τ(k), v(k) have range [0; 1] with

• Xt, Xt−k independent ⇔ τ(k) = v(k) = 0.

• Xt depends perfectly on Xt−k ⇔ τ(k) = v(k) = 1.

Christian H. Weiß — Darmstadt University of Technology

Page 38: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Serial Dependence

Cardinal case: Positive and negative autocorrelation.

⇒Concept of signed dependence: (Weiß & Göb, 2008)

• Xt, Xt−k perfectly positively dependent,if perfectly dependent and

if pi|i(k) = 1 for all i.

• Xt, Xt−k perfectly negatively dependentif perfectly dependent and

if pi|i(k) = 0 for all i.

Christian H. Weiß — Darmstadt University of Technology

Page 39: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Serial Dependence

Measures of signed dependence:

(Weiß, 2011; Weiß & Göb, 2008)

Cohen’s κ:

κ(k) =∑j pjj(k)− s2(p)

1− s2(p)with range [− s2(p)

1−s2(p); 1],

Modified κ:

κ∗(k) = 1m ·

(∑j pj|j(k) − 1

)with range [− 1

m; 1].

Some properties:

• Xt, Xt−k independent ⇒ κ(k) = κ∗(k) = 0.

• Xt, Xt−k perf. positively dep. ⇔ κ(k) = κ∗(k) = 1.

• Xt, Xt−k perf. negatively dep. ⇒ κ(k), κ∗(k) minimal.

Christian H. Weiß — Darmstadt University of Technology

Page 40: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Serial Dependence

Empirical measures of (signed) serial dependence

based on time series X1, . . . , XT .

Weiß (2011): empirical Cohen’s κ and modified κ:

κ̂(k) := 1+ 1T −

1− ∑j p̂jj(k, T )

1− s2(p̂(T )),

κ̂∗(k) := 1T + 1

m ·(∑

j

p̂jj(k, T )

p̂j(T )− 1

).

Christian H. Weiß — Darmstadt University of Technology

Page 41: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Serial Dependence

Empirical measures of (signed) serial dependence

If computed from i.i.d. data, then

κ̂(k), κ̂∗(k) are asymptotically normally distributed with

E[κ̂(k)

]= E

[κ̂∗(k)

]= 0 + O(T−2),

and variances given by

V[κ̂(k)

]= 1

T ·(1 − 1+2s3(p)−3s2(p)

(1−s2(p))2

)+ O(T−2);

V[κ̂∗(k)

]= 1

m·T + O(T−2).

Christian H. Weiß — Darmstadt University of Technology

Page 42: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Serial Dependence

Current and future research:

• κ̂(k), κ̂∗(k) and also τ̂(k), v̂(k) for NDARMA processes;

• goodness-of-fit tests for NDARMA processes.

Christian H. Weiß — Darmstadt University of Technology

Page 43: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

CategoricalTime Series

Modelling

Page 44: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Modelling

Models for stationary categorical processes

with range V = {0, . . . ,m}:

• pth order Markov model: (m+1)p ·m parameters;

• variable length M. m. of Bühlmann & Wyner (1999):

more parsimonious, but model choice difficult;

• MTD(p) model of Raftery (1985):

still m(m+1)+ p − 1 parameters;

• NDARMA(p, q) models of Jacobs & Lewis (1983):

m+p+q parameters, also non-Markovian dependence.

Christian H. Weiß — Darmstadt University of Technology

Page 45: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Modelling

(Xt)Z, (ϵt)Z: categorical processes with range V = {0, . . . ,m};(ϵt)Z: i.i.d. with marginal distribution P (ϵt = j) = pj > 0,

ϵt independent of (Xs)s<t.

For φq > 0, with ϕp > 0 if p ≥ 1, let

Dt = (αt,1, . . . , αt,p, βt,0, . . . , βt,q) ∼ MULT (1;ϕ1, . . . , ϕp, φ0, . . . , φq)

be i.i.d. and independent of (ϵt)Z, (Xs)s<t.

(Xt)Z is NDARMA(p, q) process if

Xt = αt,1 ·Xt−1+ . . .+αt,p ·Xt−p + βt,0 · ϵt+ . . .+βt,q · ϵt−q.

Christian H. Weiß — Darmstadt University of Technology

Page 46: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Modelling

Some properties:

• marginal distribution P (Xt = j) = pj;

• only shows positive serial dependence, and

κ(k) = κ∗(k) = v(k) = τ(k);

• Yule-Walker-type equations

κ(k) =∑pj=1 ϕj·κ(|k−j|) +

∑q−ki=0 φi+k·r(i) for k ≥ 1,

where r(i) = 0 for i < 0, r(0) = φ0, and

r(i) =∑i−1j=max {0,i−p} ϕi−j·r(j) +

∑qj=0 δij·φj for i > 0.

Christian H. Weiß — Darmstadt University of Technology

Page 47: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Time Series — Modelling

Disadvantage of NDARMA models:Only describe positive dependence, i. e.,

categorical time series with long runs of symbols.

(e. g., genetic sequence vs. letters of text)

Issues for future research:

• Find simple and sparsely parametrized models

that also allow for negative dependence!

• Definition of trend or seasonality?

• And: How to remove such trend or seasonality?

Christian H. Weiß — Darmstadt University of Technology

Page 48: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

CategoricalProcesses

Monitoring

Page 49: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Processes — Monitoring

Only very few approaches for monitoring processes of

unordered and mutually exclusive categories.

Monitoring samples from an i.i.d. categorical process:

Duncan (1950), Marcucci (1985), Nelson (1987), Mukho-

padhyay (2008) plot Pearson’s χ2-statistic for goodness of

fit on a control chart.

Continuously monitoring cat. process (100% inspect.):

Weiß (2010) proposes

• two moving-average-type charts,

based either on Pearson statistic or Gini index,

• a (k, r)-runs chart.

Christian H. Weiß — Darmstadt University of Technology

Page 50: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Processes — Monitoring

(k, r)-run: finished after k successive observationsof either ‘1’ or ‘2’ or . . . or ‘r’, i. e., after observingone of (1, . . . ,1), (2, . . . ,2), . . . , (r, . . . , r) of length k each.

(k, r)th run lengths (Y (k,r)n )N determined as

Y(k,r)1 := No. obs. until first occurrence of k-tuple of ‘1’s or . . . ‘r’s,

Y(k,r)n := No. obs. after (n− 1)th occurrence of k-tuple of ‘1’s or . . . ‘r’s

until nth occurrence of k-tuple of ‘1’s or . . . ‘r’s, for n ≥ 2.

Example: m = 3 (i. e., V = {0,1,2,3})and (k, r) = (2,3), (fictive) time series:

1 2 0 0 0 3 0 2 2︸ ︷︷ ︸9

0 3 0 0 1 0 1 1︸ ︷︷ ︸8

1 1︸ ︷︷ ︸2

3 0 2 0 3 3︸ ︷︷ ︸6

2 3 . . .

Christian H. Weiß — Darmstadt University of Technology

Page 51: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Processes — Monitoring

If (Xt)N Markov chain, then

(Y (k,r)n )N i.i.d. process, range Nk := {k, k +1, . . .}.

Properties: (Chryssaphinou et al., 1994)

Let ck,r(z) :=r∑

i=1

(1− piz)(piz)k

1− (piz)k, then

E[Y (k,r)] =1

ck,r(1), V [Y (k,r)] =

1+ ck,r(1)− 2c′k,r(1)

c2k,r(1).

Christian H. Weiß — Darmstadt University of Technology

Page 52: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Categorical Processes — Monitoring

(k, r)-runs chart: (Weiß, 2010)

(Y (k,r)n )N plotted on chart with k ≤ LCL < UCL.

Exact ANE computation with Markov chain approach.

Issues for future research:

• charts based on different patterns,

e. g., cycles instead of runs;

• CUSUM or EWMA methods for categorical processes,

e. g., related to patterns or certain statistics; . . .

Christian H. Weiß — Darmstadt University of Technology

Page 53: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Conclusions

In a nutshell,

the field of categorical time series . . .

• is relevant for practice, and

• offers a lot of topics for future research,

in any of the disciplines

analysis, modelling and monitoring!

Christian H. Weiß — Darmstadt University of Technology

Page 54: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Thank You

for Your Interest!

Christian H. Weiß

Department of Mathematics

Darmstadt University of Technology

Page 55: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Literature

Box et al. (1994): Time series analysis – forecasting and control. 3rd edition, PrenticeHall, Inc., New Jersey.Bühlmann & Wyner (1999): Variable length Markov chains. Ann. Stat., 27, 480–513.Chryssaphinou et al. (1994): On the waiting time of appearance of given patterns. In:Godbole & Papastavridis (Eds.): Runs and Patterns in Probability, Kluwer AcademicPublishers, 231–241.Duncan (1950): A chi-square chart for controlling a set of percentages. IndustrialQuality Control, 7, 11–15.Jacobs & Lewis (1983): Stationary discrete autoregressive-moving average time seriesgenerated by mixtures. J. Time Series Anal., 4(1), 19–36.Keim & Kriegel (1996): Visualization techniques for mining large databases: A com-parison. IEEE Trans. Knowl. Data Eng., 8(6), 923–938.Marcucci (1985): Monitoring multinomial processes. J. Qual. Tech., 17(2), 86–91.Mukhopadhyay (2008): Multivariate attribute control chart using Mahalanobis D2 sta-tistic. J. Appl. Statist., 35(4), 421–429.Nelson (1987): A chi-square control chart for several proportions. J. Qual. Tech.,19(4), 229–231.Raftery (1985): A model for high-order Markov chains. J. Royal Stat. Soc., B, 47(3),528–539.

(. . . )

Christian H. Weiß — Darmstadt University of Technology

Page 56: Categorical Time Series: Analysis, Modelling, Monitoring?€¦ · Categorical Time Series — Motivation Medical Diagnoses of an examiner for a certain type of examination. (→ ENBIS

Literature

(. . . )

Ribler (1997): Visualizing categorical time series data with applications to computerand communications network traces. PhD thesis, Virginia State University.Uschner (1987): Streuungsmessung nominaler Merkmale mit Hilfe von Paarverglei-chen. Doctoral dissertation, University Erlangen-Nürnberg.Vogel & Kiesl (1999): Deskriptive und induktive Eigenschaften zweier Streuungsmaßefür nominale Merkmale. In: Vogel (Edt.): Arbeiten aus der Statistik, Univ. Bamberg.Weiß (2008): Visual analysis of categorical time series. Statist. Meth., 5(1), 56–71.Weiß (2010): Continuously monitoring categorical processes. Qual. Tech. Quant. Ma-nag., to appear.Weiß (2011): Empirical measures of signed serial dependence in categorical time se-ries. J. Statist. Comp. Simulation, 81(4), 411–429.Weiß & Göb (2008): Measuring serial dependence in categorical time series. Adv.Statist. Anal., 92(1), 71–89.Ye et al. (2002): Multivariate statistical analysis of audit trails for host-based intrusiondetection. IEEE Trans. Computer, 51(7), 810–820.

Christian H. Weiß — Darmstadt University of Technology