Implementation and Comparison of Preprocessing Methods for ... · Implementation and Comparison of Preprocessing Methods for Biosurveillance Data Thomas Lotze1, Sean Murphy2 and Galit

Implementation and Comparison of Preprocessing Methodsfor Biosurveillance Data

Thomas Lotze1, Sean Murphy2 and Galit Shmueli1,3

1Applied Mathematics and Scientific Computation Program, University of Maryland, College Park, MD 207422The Johns Hopkins University Applied Physics Laboratory3Department of Decision and Information Technologies and the Center for Health Information and Decision Systems, Robert H

Smith School of Business, University of Maryland, College Park, MD 20742

Abstract

Modern biosurveillance relies on multiple sources of both pre-diagnostic and diagnostic data, updated daily,to discover disease outbreaks. Intrinsic to this effort are two assumptions: (1) the data being analyzed containearly indicators of a disease outbreak and (2) the outbreaks to be detected are not known a priori. However,in addition to outbreak indicators, syndromic data streams include such factors as day-of-week effects, seasonaleffects, autocorrelation, and global trends. These explainable factors obscure unexplained outbreak events andtheir presence in the data violates standard control chart assumptions. Monitoring tools such as Shewhart,Cumulative Sum, and Exponentially Weighted Moving Average control charts will alert largely based on theseexplainable factors instead of on outbreaks. The goal of this paper is twofold: first, to describe a set of toolsfor identifying explainable patterns such as temporal dependence, and second, to survey and examine severaldata preconditioning methods that significantly reduce these explainable factors, yielding data better suited formonitoring using the popular control charts.

1 Introduction: Monitoring via Controlcharts

Control charts are a tool for monitoring a processparameter (such as the process mean) by comparingdaily parameter estimates to pre-determined thresh-olds called control limits. While other methods wereproposed and sometimes used ( [1], [2], [3]), con-trol charts remain one of the most popular mon-itoring tools in traditional and modern biosurveil-lance. Such charts are applied in the widely-usedmonitoring systems BioSense ( [4]), RODS ( [5]),EARS ( [6]), and ESSENCE ( [7]). The classic She-whart chart for monitoring the process mean relieson drawing a sample from the process at some fre-

quency (e.g., weekly), and plotting the sample meanon the chart. Parameter limits are defined such thatif the process remains in control, all (or nearly all)of the sample means will fall within the control lim-its. If a sample mean exceeds the control limits, itis assumed that the process mean has shifted, or inother words, the process has gone out of control;an alarm is triggered and an investigation followsto find its cause(s) ( [8], [9]). Figure 1 shows anexample of a one-sided Shewhart control chart, fordetecting increases in the process mean. The dot-ted line indicates the control limit; red points showpoints exceeding the limit.

[Figure 1 approximately here]

1

To better understand how control charts can beapplied to biosurveillance data, we discuss both theoriginal and intended use of these charts. Statis-tical control charts, invented by Walter Shewhart,were first used in the 1920s to monitor factory out-puts to discover abnormally high rates of product de-fects. An alarm indicated variance beyond the nor-mal operating conditions and the presence of a “spe-cial cause”, which was usually a faulty process thatcould then be corrected. Control charts have sincebeen applied to a growing number of areas beyondindustrial control, including extensive application tobiomedical monitoring [10]. The three most com-monly used control charts are: (1) Shewhart chartsthat monitor values of a sample statistic (e.g., themean or standard deviation) or individual counts,(2) Cumulative Sum (CuSum) charts that monitorcumulative sums of sample deviations from a tar-get process mean, and (3) exponentially weightedmoving average (EWMA) charts that monitor anexponentially weighted average of current and pastsample statistics. While all of the different chartsmonitor deviations from the target value of the pro-cess, each one is most effective at detecting a par-ticular type of deviation from the target mean: asingle spike, a shift in the process mean, or a grad-ual increase in the mean, respectively [11]. Underly-ing all of these methods is the assumption that themonitoring statistics are independent and identicallydistributed (iid), with the distribution generally as-sumed normal (although modifications can be madefor statistics with known, non-normal distribution).While control charts are very effective for monitor-ing processes that meet the independence and knowndistribution assumptions, they are not robust whenthese assumptions are violated [12]. If the controlchart assumptions do not hold, then they will failto detect special cause variations and/or they willalert frequently even in the absence of special causevariations.

1.1 Challenges with Biosurveillance Data

Modern biosurveillance data come from multiplesources and in many forms. In general, syndromicdata tend to be indirect measures of a disease (asopposed to more traditional diagnostic or clinicaldata). Examples are daily counts of emergency roomvisits, over-the-counter (OTC) or prescription med-ication sales, school absences, doctors’ office visits,veterinary reports, or other data streams that could

contain an indication of a disease outbreak. Thepurpose of biosurveillance is to monitor these timeseries to detect disease outbreaks. As in the indus-trial setting, control charts are used to monitor suchdata to detect “special causes” or abnormalities thatare potentially indicative of an outbreak. However,currently collected biosurveillance data violate mostof the assumptions required of data monitored bycontrol charts. Thus, alarms triggered by controlcharts applied directly to raw syndromic data canarise not from actual outbreaks but due to explain-able patterns in the data. Reports of very high falsealarm rates from users of current syndromic systemslend evidence to this claim.

The explainable patterns are caused by factorsunrelated to a disease. As an example, it is quitecommon for doctors’ offices to have reduced staffingon weekends. Therefore, a syndromic data streamcapturing daily doctor visits will see an explain-able and predictable drop on Sundays and a corre-sponding increase on Monday. Many syndromic datastreams demonstrate a marked day-of-week (DOW)effect, dropping or increasing in counts over theweekends with an early work-week resurgence ordrop. Holidays and other external factors can causea similar phenomenon. Even the release of HarryPotter books has a known effect on hospital admis-sions [13].

1.2 Effect of Assumption Violations on ControlCharts

Although different data streams exhibit different be-havior, a few explainable patterns exist that arecommon to many series and that clearly violate con-trol chart assumptions. The presence of explainablecomponents in syndromic data leads to a direct vio-lation of the assumption of iid counts. These effectshave been also seen in traditional surveillance sys-tems ( [14]) but are even more pronounced in syn-dromic surveillance: due to its more frequent col-lection, it is subject to greater autocorrelation; dueto its data source being less direct, it is more in-fluenced by components not related to disease out-break. Therefore, in order to increase the effec-tiveness of control chart monitoring for biosurveil-lance, we must account for these components in theraw data. Optimally, once these components are re-moved, we would be left with an iid series. Any shiftsin the process would then be attributable to unex-plained components; the likelihood that a shift in

2

the process corresponds to a disease outbreak wouldthen increase, thereby increasing the probability ofdetection while decreasing the probability of a falsealarm.

An example of this can be seen by comparing aCuSum chart applied to the series of daily sales ofallergy medication, before (Figure 2, top) and after(Figure 2, bottom) preconditioning.


Here, the preconditioning significantly reduces theimpact of seasonality. It is important to note thatit is not simply the number of alerts that decreasesdue to the preconditioning (this could be achievedby simply raising the alerting threshold), but thatthe pattern of the alerts changes. Instead of mul-tiple alerts and generally high levels for each sea-son, we see that there are much tighter spikes af-ter preconditioning. The overall level is much lowerand the seasonal impact is reduced, indicating thatother deviations (of more interest) would be moreeasily noticed in the preconditioned data. This canbe achieved by removing explainable patterns fromthe data. We now describe the most prominent ex-plainable patterns found in biosurveillance data, andtheir effect on chart assumptions.

The first explainable pattern is cyclic behaviorincluding day-of-week and seasonality: the magni-tudes of syndromic data often vary widely as a func-tion of the day-of-week or time of year. If left un-corrected, this source of variation inflates the “nor-mal variation” assumed by the control chart, therebyleading to overly conservative control limits; in bio-surveillance, this can result in outbreaks being de-tected late or not at all. Alternatively, if the controllimits are set low enough to detect true outbreaks,they will also be low enough to be set off by nor-mal seasonal variance, resulting in much higher falsealarm rates.

The second explainable pattern is daily autocor-relation. Because syndromic data typically arrivedaily, there is almost always a strong degree of auto-correlation; sequential daily counts are not indepen-dent (because so much of the variance for sequentialdays comes from common causes). This can result ina higher level of false alerts in CuSum and EWMAcharts, as one abnormal count is likely to be followedby more abnormal counts.

A third pattern is related to holidays. Holi-days strongly impact the public’s consumption of

health care, drastically impacting (usually decreas-ing) many syndromic data streams. These outliersartificially increase the sample variance, causing thesame issues as seasonal variation. In addition, manydetection methods are sensitive to these “negativesingularities” and will falsely report outbreaks whencomparing new counts to past low holiday values [2].

Finally, particularly for syndromic series withvery low counts (such as the number of unexpecteddeaths), the distribution of daily counts is far fromnormal, causing standard control limits to be incor-rectly set. Applying control charts directly to rawsyndromic data will either fail to detect actual out-breaks or will frequently alarm, despite the absenceof actual outbreaks. While a high false alarm ratecan be accounted for by artificially widening the con-trol limits, this will then reduce the power to detecttrue deviations of interest. The reverse is also true.To remedy these problems we propose to remove ex-plainable patterns from the raw data in an attemptto come closer to meeting the assumptions of controlchart methods, and thus achieving a better ratio ofdetection power to false alarm rate.

If these explainable patterns are not removed,the control chart assumptions will not be met, andso the resulting alerts will be largely based on thesepatterns. Because they can have such a dramaticimpact on the quality of the resulting control chartanalysis, determining the most effective method forremoving these patterns from a given dataset is veryimportant. The tools used in this paper show quanti-tative and qualitative methods for comparing meth-ods’ applicability to a syndromic data series and ef-fectiveness at removing the explainable effects.

The paper proceeds as follows: Section 2 de-scribes the syndromic data used throughout the pa-per. Section 3 describes a set of statistical tools fordetecting explainable patterns that “contaminate”the data, which make direct control chart monitor-ing ineffective. We then apply these tools to the syn-dromic data and illustrate their use. The paper’sfinal goal is to survey and evaluate popular meth-ods for data preconditioning. These methods areaimed at removing explainable patterns from the rawdata, thereby creating “conditioned data” that canbe monitored more effectively using control charts.Section 4 describes different preconditioning meth-ods and evaluates their usefulness by applying themto syndromic data. A comparison is given in Sec-tion 5. We conclude and describe future directionsin Section 6.

3

2 Data Description2.1 Over-the-counter (OTC) medication sales

The first dataset, from a grocery chain in the Pitts-burgh area, includes daily sales for seven categoriesof medications, from August 1999 to January 2001[15]. Average daily counts vary largely across differ-ent categories, with varying degrees of weekly andannual dependence. The seasonal factor dominatesthe time series, and would therefore be the majorcause for alerts in standard control charts. In thefollowing we use one of the series (throat lozenges)to illustrate the behavior of OTC series. Two othersappear in the Appendix (Asthmatic and Allergiesmedications).

2.2 Chief complaints at emergency departments

The second dataset, from ESSENCE (ElectronicSurveillance System for the Early Notification ofCommunity-Based Epidemics), is composed of 35time series representing daily counts of ICD-9 codes.ICD-9 codes are the 9th edition of the Interna-tional Statistical Classification of Disease and Re-lated Health Problems, published by WHO and usedworldwide. These ICD-9 codes are generated by pa-tient arrivals at emergency departments (ED) in anunspecified metropolitan region from Feb-28-1994 toDec-30-1997. The 35 series were then grouped into13, using the CDC’s syndrome groupings. Thesesyndrome groups show a diversity in the level ofdaily counts and in weekly and annual dependenceacross the different syndrome subgroups. The countsfor the 38 holidays contained in the dataset wereeliminated. In the following we use one series (Gas-trointestinal (GI)-related ED visits), and two addi-tional ED visits series (Respiratory and UnexplainedDeaths) are displayed in the Appendix.

3 Tools for Detecting Explainable Pat-terns

There are many tools available for detecting ex-plainable patterns in the data. Although some ofthese (notably domain knowledge and graph anal-ysis) will require human intervention, this analysisneed not be carried out every day; once a serieshas been analyzed, the preprocessing method canbe chosen and applied continuously with occasionalchecks.Although it is tempting to completely auto-mate the analysis and preprocessing of syndromic

data series, human intervention is still a very valu-able tool for finding and removing explainable pat-terns in the data. We now describe each of threemethods for detecting explainable patterns.

3.1 Domain Knowledge

The first method for determining temporal pat-terns in syndromic data is to use domain knowl-edge. From public health and medical sources wecan learn whether there exist day-of-week effects incounts of emergency department visits and doctors’office visits. Since many hospitals dramatically re-duce staffing on weekends ( [16] [17] [18] [19] [20] [21][22]), counts are generally much lower on weekends.We also know that seasonal trends might be presentin ED visits for reasons such as flu season. Market-ing knowledge can tell us that grocery shopping ismore popular on weekends than on weekdays. Andfor both types, holidays always have exceedingly lowcounts.

3.2 Graphs and charts

The second step is to use statistical summaries andgraphs to quantify such effects, and to identify oth-ers. Some useful statistics and graphs are:

Time plots of each series with zoomed-in views(for detecting local effects such as DOW).

Moving average charts for detecting overalltrends, with narrow windows for local trendsand wider ones for global trends. A windowof width 7 suppresses DOW effects, whereaswidth 28 suppresses (nearly) monthly effects.

Moving standard deviation charts for deter-mining whether the seasonal variation is addi-tive or multiplicative (that is, if higher valuesalso lead to a greater standard deviation, asis normally the case in count data). This cansuggest using a logarithmic transform or amultiplicative seasonality model.

Autocorrelograms - plots of autocorrelations andpartial autocorrelations at different lags forhighlighting periodic effects and temporal de-pendence. A lag 1 correlation indicates dailyautocorrelation, a lag 7 and its multiples indi-cate a day-of-week effect, and a lag 365 indi-cates a yearly pattern.

4

Normal probability plots and histograms forassessing normality. Skewness and kurtosisstatistics are also useful for this purpose.

Two additional potentially useful statistics andgraphs are partial autocorrelations at lag 365 andspectral plots. However, we do not use partial auto-correlograms because partial autocorrelation valuesfor a 365-day lag are very sensitive to missing val-ues, and are not reliable when holidays have beenremoved. We also do not use spectral plots becausethey tend to mask weekly seasonality when there is astrong yearly seasonality (the size of the weekly peakcan be so small as to be barely distinguishable).

3.3 Summary statistics

The following statistics are useful for detecting pat-terns and evaluating and comparing preprocessingmethods:

mean: the sample mean

stdev: the sample standard deviation

weekendMean: the mean from weekends only; de-viations from the global mean are indicative ofthe magnitude of the weekday-weekend effect

percentInMin: the percent of values which are atthe minimum value for the series; higher valuesindicate that this is a “low count” series

pacfWeek: the partial autocorrelation function(pacf) coefficient at 7-day lag; greater absolutevalues indicate stronger day-of-week effect

acfWeek: the autocorrelation function (acf) coeffi-cient at 7-day lag; greater absolute values indi-cate stronger day-of-week effect. If there is nocorrespondingly large pacf, it may be indica-tive of short-term (shorter than 7-day) clus-tering effects in the series

acfYear: the maximum acf coefficient at 364- or365-day lag; greater absolute values indicatestronger yearly seasonality (364 is preferablebecause it will be the same day-of-week and of-ten shows a greater degree of correlation than365)

daysHighPacf: the number of days the series has a“significantly high” (greater than 4/

√n) pacf;

This measure is often used for determining

the autocorrelation order in an ARIMA model,and here can indicate the length of significantlocal (not yearly) seasonality effects

skewness: the skewness of the series; deviationsfrom 0 indicate non-normality, with positivedeviations indicating a positive skew and neg-ative deviations indicating negative skew

excessKurtosis: the kurtosis-3; deviations from 0indicate non-normality, with positive values in-dicating more centrally peaked data and neg-ative values indicating larger-tailed data

3.4 Applying graphs and summaries to dataA close examination of the characteristic plots andsummary statistics can be used to detect differentexplainable patterns in the data. The top row ineach of Figures 3 and 4 present several of these plotsfor sales of OTC throat lozenges (Figure 3) and GI-related ED visits series (Figure 4). Correspondingsummary statistics are given in the left column ofTables 1 and 2. See the Appendix for plots andstatistics for four additional series.

Seasonality

The degree of seasonality in the data can often be de-termined by a visual inspection of the time series atdifferent temporal scales. Autocorrelograms (ACFand PACF plots) and spectral plots are two addi-tional plots useful in uncovering seasonal patterns.However, one must be careful in using spectral plots,as large peaks can mask seasonal variance at otherscales.

Most of the series in our datasets exhibit a pro-nounced seasonal pattern with peaks during the win-ter months. This can be seen for two of the se-ries in the top left panels in Figures 3 and 4, wherethroat lozenge sales exhibit a 3-month cycle and GI-related ED visits exhibit a yearly pattern (see Ap-pendix for similar plots for four more series). Thiscan also be seen in the relative values of the 365-dayACF (sixth column), where a large value indicates astrong yearly seasonal component.

Day-of-week (DOW) effect

To detect day-of-week effects we first zoom-in toa shorter one-month period of the data. The sec-ond column (top row) in Figure 3 displays this for

5

throat lozenge sales and in Figure 4 for and GI-related ED visits. Weekends are highlighted in thesezoom plots. Another useful tool are ACF and PACFplots (columns 5-7 in these Figures). Both series ex-hibit a strong DOW effect: OTC sales have peakson weekends (due to the general trend of high vol-ume purchasing on weekends), and ED visits dropon weekends. The ACF plot for ED visits and someof the OTC medications (see Appendix for plots offour additional series) shows high autocorrelation atlags 7, 14, 21, etc., indicative of a DOW effect. TheDOW effect is even present in unexplained deaths!

Autocorrelation

Autocorrelograms (graphs of the estimated autocor-relation as a function of the lag) are useful for study-ing the correlation of the data series with itself atvarious lags, and can indicate lags that play an im-portant role. As mentioned above, the autocorrelo-grams (columns 5-6 in Figures 3 and 4 and the Ap-pendix) show high autocorrelation at lags 7, 14, 21,etc. for most of the series, indicative of a DOW ef-fect. This can also be seen in the higher values forthe acfWeek and pacfWeek statistics (left column inTables 1 and 2 and the corresponding tables in theAppendix). In addition, examining longer lags indi-cates bi-annual seasonality for throat lozenges sales,while for GI-related ED counts the weekly patternrepeats and is stronger than a yearly pattern.

Normality

To evaluate how closely the data follow a normal dis-tribution, we use histograms and a normal probabil-ity plot and also examine summary statistics suchas skewness and excess kurtosis. From the normalprobability plots in column 4 of Figures 3 and 4we see that both series exhibit significant deviationsfrom normality. Throat lozenges (and other OTCsales) are skewed to the right; they also seem tobe slightly more tightly clustered than normal data.GI-related (and other) ED counts appear to be bi-modal (one peak for weekends, one for weekdays).The deviation from normality is also seen in the val-ues for skewness and kurtosis in the summary tables(Tables 1 and 2)



[Table 1 approximately here]

[Table 2 approximately here]

4 Methods for PreconditioningSeveral methods exist for removing explainable fac-tors from the data. These include model-basedmethods, which assume a particular model and esti-mate the parameters in that model, and data-drivenmethods, which fit the data non-parametricallyrather than attempting to model the causes. Themethods can also differ in their global versus localnature.

4.1 Linear regression models

Regression models are a popular method for captur-ing recurring patterns such as day-of-week, season-ality, and trends [23]. The classic assumption is thatthese patterns do not change over time, and there-fore the entire data can be used to estimate them.To model the different patterns, suitable predictorsare created:

Day-of-week effects can be captured by sixdummy variables, each representing one dayof the week (relative to the remaining baselineday). If there is only a weekday/weekend ef-fect, a single dummy variable can be used.

A global linear trend can be modeled using apredictor t that is a running index (t =1, 2, 3 . . .). Other types of trends such as expo-nential and quadratic trends can also be cap-tured via a linear model by transforming theresponse and/or index predictor, or by addingtransformations of the index predictor (such asadding t2 to capture a quadratic trend).

Seasonality can be modeled by a sinusoidal trend.The Centers for Disease Control and Preven-tion (CDC) use a regression model that in-cludes sine and cosine functions to capturea cyclical trend of mortality rates due to in-fluenza [24, 25], although these terms will notbe significant in series without pronounced sea-sonality. Another regression-based method fordealing with seasonality is to fit local regres-sion models, using past data from the same

6

time of year ( [26]). Note that explicit model-ing of seasonal variation assumes that the sea-sonal pattern remains constant from year toyear.

Holidays can be captured by constructing adummy variable for holidays or by replacingholiday values with missing values.

From our experience as well as other reports inthe literature [27, 28], we find that seasonality ef-fects tend to be multiplicative rather than additivewith respect to the response variable. Thus, a linearmodel where the response is transformed into a nat-ural log (log(y)) is often appropriate. For our dataseries, we fit a linear regression and a multiplicativeregression, and found that the multiplicative versionbetter captured the day-of-week effect. Both are re-ported below.

Currently, several biosurveillance systems imple-ment some variation of a regression precondition-ing. ESSENCE uses a linear regression model thatincludes day-of-week, holiday, and post-holiday in-dicators ( [7]) and BioSense uses a Poisson regres-sion with predictors that include a linear trend, sineand cosine effects for seasonality, month indicators,DOW indicators and Holiday and day-after holidayindicators ( [29]).

Regression models can also be used to integrateexternal information that can assist in removing ex-plainable patterns. For example, the seasonal pat-tern was highly correlated with temperature. Figure5, which shows the relationship between counts ofthroat lozenge sales (in black) and the average dailytemperature (in red), demonstrates this relationship.There is a strong negative relationship between tem-perature and sales: as the weather gets colder, morecough remedy drugs are sold. However, the causal-ity of temperature is unclear and we therefore treatit only as a proxy. One way is to create alternativetemperature-related predictors for capturing yearlyseasonality. An example is a Date function, which isa linear function rising to 1 in winter and decreasingto -1 in summer.


The regression model for our data includes dailydummy variables (Monday, Tuesday, Thursday, Fri-day, Saturday, Sunday) to account for the DOW ef-fect, a holiday indicator (Holiday), an index variable(index) to capture a linear trend, and daily average

temperatures (Tavg) and monthly dummy variables(Jan, Feb, Mar, Apr, May, Jul, Aug, Sep, Oct, Nov,Dec) to remove seasonality. Figure 6 shows an ex-ample of the resulting time series of residuals (ac-tual value - predicted value) for the sales of throatlozenges. This series was later used as input intostandard control charts.


The main advantage of regression modeling isthat it provides a general yet powerful method toremove variation due to factors unrelated to out-breaks. It is relatively effective at removing bothyearly seasonality and day-of-week variation. How-ever, it requires a fairly large amount of data for ob-taining accurate estimates, especially for long-termpatterns.

4.2 Ratio-to-moving-average indexes

For cyclical data, with virtually any cycle length(weekly, monthly, yearly, etc.), we can compute sea-sonal indexes and use them to deseasonalize thedata. Seasonal-adjustment methods are very popu-lar in business and government agencies such as theBureau of Labor Statistics and the Census Bureauuse such methods to report figures such as monthlyunemployment rates.

A simple method to compute indexes is the ratio-to-moving-average method. This is also the basisfor the X-11 and X-12 systems used by the CensusBureau [30]. The idea is to estimate and removeany linear trend from the data, and then to estimatethe seasonal component in the de-trended data. Tocompute day-of-week seasonal indexes the followingalgorithm is used:

1. Estimating the trend: For each day, computethe moving average with a 7-day window cen-tered around that day. For example, for aTuesday and a window of seven days, we com-pute the average of the 3 previous days (Sat,Sun, Mon), the value on Tuesday itself, and onthe 3 following days (Wed, Thur, Fri).

2. Removing the trend: Divide the daily value byits corresponding moving average. These arethe raw seasonals.

3. Estimating seasonality: Compute the averageof all raw seasonals for the same day (e.g, the

7

raw seasonal for each of the Tuesdays are av-eraged across the entire period).

4. Scale the averages so that they sum to 1.

This algorithm gives multiplicative indexes, whereeach index gives the percentage of counts on thatday relative to the weekly average. For example, anindex of 1.2 for Tuesday would mean that Tuesdayshave 120% higher counts than the average weeklycount. A similar process can be followed to computemonthly indexes or any other fixed period.

It is possible to compute and remove multi-ple seasonal cycles with different periods such asa weekly cycle and an annual cycle. For the syn-dromic data we tried both a 7-day seasonal adjust-ment (DOW effect) and a yearly adjustment (doneusing approximate monthly seasons: 365-day pe-riod, 12 seasons), as well as combinations of thetwo (first weekly adjustments, then yearly; and vice-versa). While the 7-day procedure is quite effectiveat removing weekly patterns, it obviously does notremove yearly seasonality. The yearly deseasonal-izaion technique is somewhat effective at removingday-of-week and seasonal patterns, but less so thanthe other methods described in this section. Sea-sonal adjustments via ratio-to-moving-average in-dexes should only be performed on the raw countdata; if this method is performed on normalized datacentered around zero (such as regression residuals),it frequently generates abnormally high results dueto division by an average very close to zero, therebycreating highly unrepresentative results.

4.3 Differencing

Differencing is the operation of subtracting a pre-vious value from a current one. The order of dif-ferencing gives the vicinity between the two values:an order 1 differencing means that we take differ-ences between consecutive days (yt− yt−1), whereasan order 7 differencing means subtracting the valueof the same day last week (yt − yt−7). This is apopular method in time series analysis, where thegoal is to bring a non-stationary time series closerto stationarity [31]. Differencing has an effect bothon removing linear trends as well as removing recur-ring cyclic components. In the context of syndromicdata, the only instance where differencing was sug-gested is in [32]. They show that a 7-day differencingcan be effective at normalizing syndromic data.

In our data, the DOW effect is relatively stablethroughout the entire period. We therefore use anorder 7 difference. The preconditioned time series issimply the difference between the value on the cur-rent day and the value 7 days ago. In addition, weaccounted for holidays by removing the values onholidays, and then obtaining differenced values forthe 7th day following a holiday by differencing at lag14 (i.e., subtracting the value from two weeks prior).This improves the method by removing outliers fromknown (holiday) causes.

The main advantage of differencing is that it iseasy and computationally cheap to perform, and soprovides an excellent basis for comparison. It is veryeffective at removing both weekly and monthly pat-terns but can result in abnormally high results afterabnormally low points in the original data (called“negative singularities” by [2]). Another side-effectof seven-day differencing is that it creates strongweekly partial autocorrelation effects and can in-crease the variance in the data if there is little orno existing DOW effect.

4.4 Holt-Winter’s exponential smoothing

The Holt-Winters’ exponential smoothing techniqueis a form of smoothing in which a time series at timet is assumed to consist of three components: a levelterm Lt, a trend term Tt, and a seasonality term St.The k-step ahead forecast is given by

yt+k = (Lt + kTt)St+k−M , (1)

where M is the number of seasons in a cycle (e.g., fora weekly periodicity M = 7). The three componentsLt, Tt, and St are updated, as new data arrive, asfollows:

Lt = αYt

St−m+ (1− α)(Lt−1 + Tt−1) (2)

Tt = β(Lt − Lt−1) + (1− β)Tt−1 (3)

St = γYt

Lt+ (1− γ)(St−M ), (4)

where α, β, and γ are smoothing constants that takevalues in (0, 1). Each component is updated at everytime step, based on the actual value at time t.

For our data we use the multiplicative seasonalityversion because the seasonal effects in our syndromictime series are generally proportional to the level Lt.An additive formulation is also available [33,34].

8

The principal advantage of this technique is thatit is data-driven and highly automatable. The userneed only to specify the cycle of the seasonal pattern(e.g., weekly), and the three smoothing parameters.The choice of smoothing parameters depends on thenature of the data and the degree to which the pat-terns are local versus global. A study by [28] consid-ered a variety of city-level time series, both with andwithout seasonal effects. They recommend using thesmoothing coefficients α = 0.4, β = 0, and γ = 0.15for seasonal series and α = 0.1, β = 0, γ = 0.15for non-seasonal series. Following this guideline, weused the first settings for each series that exhibiteda one-year autocorrelation higher than 0.15, and thesecond setting otherwise. In addition, we appliedthe modification suggested in [28], which does notupdate the parameters if the actual value deviatesfrom the prediction by more than 50% (to avoid theinfluence of outliers).

The Holt-Winters method is very effective at cap-turing yearly seasonality and weekly patterns. Al-though it is not straightforward to tune the smooth-ing parameters, the settings provided here provedgenerally effective for our syndromic data. One pointof caution should be made. As in any method thatproduces one-step-ahead predictions, a gradually in-creasing outbreak is likely to get incorporated intothe background noise, thereby masking the outbreaksignal. One solution is to generate and monitor k-day ahead predictions (k > 1) in addition to one-day-ahead predictions.

5 Method ComparisonWe compare the effectiveness of different methods byexamining the preconditioned series for explainablepatterns and evaluating their conformity to the iidnormal assumption. In particular, we evaluate thedegree of seasonality (weekly and yearly) in the databy examining the average autocorrelation and par-tial autocorrelation values at one week and one yearlags. Normality is evaluated by examining the skew-ness and excess kurtosis, as well as histograms andnormal probability plots. While numerous methodsand combinations and variations of methods wereexamined, results from only five methods are pre-sented. These five were chosen based on the abil-ity to reduce explainable effects (and resulting falsealarms in standard control charts). The methodsare: residuals from regression on the counts, residu-

als from a linear regression on log(counts), 7-day dif-ferencing, 7-day differencing modified for holidays,and forecast errors from Holt-Winters exponentialsmoothing. Figure 3 compares a the preconditionedlozenge sales series across the five methods using theproposed graphs from Section 4 as well as a CuSumchart. A similar comparison is given for the GI-related ED visits in Figure 4. Summary statisticsand charts for these series are in tables 1 and 2.Statistics and figures for additional preconditionedOTC series and ED series are given in the Appendix.

Overall, all five methods greatly improved thedata quality for syndromic surveillance via controlcharts, and therefore each method seems suitable asa first preprocessing step. In particular, we find thatdifferencing, regression, and Holt-Winter’s smooth-ing each significantly reduces seasonal patterns; byexamining the CuSum charts, we also see that thisresults in a narrower, more sharply peaked set ofalerts. These graphs also show the effect of pre-conditioning on the autocorrelation and on normal-ity. Autocorrelation at lags of seven (and its multi-ples) are greatly reduced, as are very-large lag auto-correlations. The Holt-Winter’s smoothing appearsto best remove the daily autocorrelation, while re-gression modeling retains some of this autocorre-lation. Multiplicative regression models are betterthan additive models, especially when seasonalityis present. And finally, seven-day differencing ap-pears to perform reasonably well, except for creat-ing large negative lag-7 autocorrelations and partial-autocorrelations. Including the holiday correctiondoes remove these negative partial autocorrelations,indicating that holidays do require special treatmentin the preprocessing step. When moving from high-count to low-count series (e.g., UnexplainedDeath),we find that Holt-Winter’s smoothing becomes infe-rior to other methods, and is not able to capture thecyclic patterns well (see Appendix).

Although seasonality can be relatively well ac-counted for, there are still difficulties in creatingnormally distributed residuals for some data series;mainly, this seems to be due to a strong, centrallypeaked distribution, as indicated in the histogramsand the high excess kurtosis. This is especially truefor low-count series. However, compared to the rawdata, there is definitely improvement in eliminatingmultiple modalities and in getting closer to normal-ity.

9

6 Conclusions, Limitations, and FutureWork

This paper emphasizes the need to account for ex-plainable patterns in biosurveillance data before ap-plying the widely-used control charts. We presentseveral well-known methods for removing such ef-fects and compare their usefulness. Although we fo-cus here on data that is used in temporal monitor-ing using control charts, such preprocessing can alsobe helpful in spatial and spatio-temporal monitor-ing, when an underlying iid assumption exists, suchas in the widely-used spatio-temporal scan statistic( [35]).

One future direction is to create an automatedapplication that uses these preconditioning methodsto explore and categorize each data series, providingrecommendations and rationales for various meth-ods to the end user. This automated expert sys-tem could help practitioners determine the methodswhich would best precondition their data, while al-lowing them to include domain knowledge. Such asystem could perform this function by analyzing thestatistics above, selecting appropriate precondition-ing methods, and then displaying graphical plots toillustrate the reasons for the each suggested method.The user would then be able to assess which patternsare reasonable in a particular dataset, and based onthe system’s output, to choose the preferred precon-ditioning operation(s).

This paper provides a general framework for datapreconditioning, but there are several improvementsthat can follow. First, the tools and methods de-scribed here are all aimed at univariate series, whereeach syndromic series is considered separately. Fu-ture work should consider the related multivariatenature of the series, both in choosing preprocess-ing methods and in the analysis of the preprocessedresults. Second, in this work we followed the stan-dard CDC grouping for syndromes to arrive at the13 ED series and the grocery chain’s grouping ofOTC remedies. However, it might be useful to ex-amine alternative groupings. For example, groupingthe OTC sales data into two categories (headacheand cough) or removing all category 2 counts (de-scribed by the CDC as “codes that might normallybe placed in the syndrome group, but daily volumecould overwhelm or otherwise detract from the sig-nal generated”) from the ED groups. A third issue isof data quality. For our data, we removed one seriesfrom consideration because it contained a change in

data collection midway through the period being ex-amined, when additional products were added to thecategory. Measurement issues such as this, whichalso include sudden drops in products or delays inprovided information, are common in biosurveillancedata; it is possible that some of these methods couldbe used to detect or correct such issues.

In our data, we assumed that there were noknown outbreaks. However, it is obvious that thedata contain seasons of influenza which affect bothED visits and OTC sales. The problem of unlabeleddata in the sense that we do not know exactly whena disease outbreak is present and when there is nodisease is a serious one for both modelling and per-formance evaluation. Our suggestion is to use a timeperiod that is assumed to be outbreak free for thepreconditioning step, or at least to suppress suspi-cious periods from affecting the estimation. A re-lated issue that arises in monitoring daily data isthat of gradual outbreaks. Autocorrelation betweendays (in particular, 1-day autocorrelation) shouldalso be examined and controlled for, in order to ap-proach the statistical independence assumption re-quired for standard control charts. However, a grad-ual outbreak will also increase the autocorrelationbetween days (as a rising number of people will showsymptoms). It is therefore important to rememberthe danger of embedding the outbreak signal into thebackground data. As proposed earlier, one solutionis to examine predictions that are farther into the fu-ture, and also to use a “guard band” that avoids theuse of the last few days in the detection algorithm(similar to the implementation by [36]).

Several additional issues exist when removing ex-plainable effects. Seasonal difference in variance iscommon in series with seasonal effects; this fluctu-ation in deviation also causes the series to deviatefrom an iid sample, causing more alerts in high-variance periods than low-variance periods. Low-count or sparse data are also not explicitly exam-ined by this investigation. Numerous detection al-gorithms have problems handling series with sparsecounts and numerous zeroes. A metric comparingpreconditioning methods’ treatment of low-count se-ries would be a useful contribution. Finally, we noteagain that syndromic counts on holidays are dramat-ically different from other days. Since the numberof holidays is too small to incorporate into our pre-conditioning methods, we suggest explicitly labelingthem and removing them from consideration, in or-der to improve reliability. We strongly urge the use

10

of some mechanism to account for holidays, as theyare an explainable cause of significant variation.

From a theoretic detection perspective, the prin-cipal concern with any preconditioning techniquemust be its effect on the sensitivity and timeliness ofalerting mechanisms. While proper data precondi-tioning should result in fewer false alerts due to non-outbreak variation, and a higher probability of de-tecting actual outbreaks, improper preconditioningcan result in unexpected and even unreliable perfor-mance. It is therefore essential to study the effectsof preconditioning using different methods and toassess the robustness of the results.

From an operational perspective, the main con-cern with data preconditioning techniques will be thepresentation of alerts and the data streams causingthe alerts to the end user. A health monitor usingan electronic syndromic surveillance system that re-ceives an alarm will typically want to examine theraw data stream causing the alert. With precon-ditioning, the alerting data stream will potentiallynot resemble the original data stream, reducing theend user’s belief or faith in the system. This aspectof data visualization and user interface must be ad-dressed from the end user’s perspective.

AcknowledgementsWe thank the Howard Burkom of the Johns HopkinsUniversity’s Applied Physics Laboratory, for makingthe aggregated ED dataset, previously authorized byESSENCE data providers for public use at the 2005 Syn-dromic Surveillance Conference Workshop, available tous.

For the first author, this research was performed un-der an appointment to the U.S. Department of HomelandSecurity (DHS) Scholarship and Fellowship Program,administered by the Oak Ridge Institute for Scienceand Education (ORISE) through an interagency agree-ment between the U.S. Department of Energy (DOE)and DHS. ORISE is managed by Oak Ridge AssociatedUniversities (ORAU) under DOE contract number DE-AC05- 06OR23100. All opinions expressed in this paperare the author’s and do not necessarily reflect the policiesand views of DHS, DOE, or ORAU/ORISE.

References1. Riffenburgh R, Cummins K: A simple and general

change-point identifier. Statistics in Medicine 2006,25(6):1067–1077.

2. Zhang J, Tsui F, Wagner M, Hogan W: Detection ofoutbreaks from time series data using wavelet

transform. In AMIA Annual Symposium Proceedings2003:748–752.

3. Moore A, Cooper G, Tsui R, Wagner M: Summary ofBiosurveillance-relevant statistical and data min-ing technologies 2002.

4. Bradley CA, Rolka H, Walker D, Loonsk J: BioSense:Implementation of a National Early Event Detec-tion and Situational Awareness System. MMWR2006, 54(Suppl):11–19.

5. RODS Version 4.2 User Manual(rods.health.pitt.edu/RODS%204%202%20User%20Manual.pdf).

6. Hutwagner L, Thompson W, Seeman G, TreadwellT: The bioterrorism preparedness and responseEarly Aberration Reporting System (EARS).Journal of Urban Health 2003, 80 (2) Suppl:89–96.

7. Marsden-Haug N, Foster V, Gould P, Elbert E, Wang H,Pavlin J: Code-based Syndromic Surveillance forInfluenzalike Illness by International Classifica-tion of Diseases, Ninth Revision. Emerging Infec-tious Diseases 2007, 13(2), [http://www.cdc.gov/EID/content/13/2/207.htm].

8. Page ES: Continuous inspection schemes.Biometrika 1954, 41:100–115.

9. Reinke WA: Applicability of Industrial SamplingTechniques to Epidemiologic Investigations: Ex-amination of an Underutilized Resource. AmericanJournal of Epidemiology 1991.

10. Benneyan JC: Statistical quality control methods ininfection control and hospital epidemiology, PartI: Introduction and basic theory. Infection Controland Hospital Epidemiology 1998, 19(3):194–214.

11. Box G, Luceno A: Statistical Control: By Monitoringand Feedback Adjustment. Wiley-Interscience, 1st edition1997.

12. Shmueli G, Fienberg SE: Statistical Methods in Counter-Terrorism: Game Theory, Modeling, Syndromic Surveil-lance, and Biometric Authentication, Springer 2006 chap.Current and Potential Statistical Methods for MonitoringMultiple Data Streams for Bio-Surveillance, :109–140.

13. Gwilym S, Howard D, Davies N: Harry Potter casts aspell on accident prone children. The British MedicalJournal 2005, 331:1505 – 1506.

14. Farrington C, Andrews N: Monitoring the Health of Pop-ulations: Statistical Principles & Methods for PublicHealth Surveillance, Oxford University Press 2004 chap.Outbreak detection: application to infectious diseasesurveillance.

15. Goldenberg A, Shmueli G, Caruana RA, Fienberg SE:Early statistical detection of anthrax outbreaksby tracking over-the-counter medication sales.Proceeding of the National Academy of Sciences 2002,99:5237–5240.

16. Tarnow-Mordi WO, Hau C, Warden A, Shearer AJ: Hos-pital mortality in relation to staff workload: a4-year study in an adult intensive-care unit. TheLancet 2000.

17. Czaplinski C, Diers D: The effect of staff nursing onlength of stay and mortality. Medical Care 1998.

11

18. C K, PJ G: Nurse staffing levels and adverse eventsfollowing surgery in U.S. hospitals. Image J NursSch 1998.

19. Blegen M, Vaughn T: A multisite study of nursestaffing and patient occurrences. Nursing Economics1998.

20. Strzalka A, Havens D: Nursing care quality: compar-ison of unit-hired, hospital float pool, and agencynurses. J Nurs Care Qual 1996.

21. McCloskey JM: Nurse staffing and patient out-comes. Nursing Outlook 1998.

22. Archibald L, Manning M, Bell L, Banerjee S, Jarvis W:Patient density, nurse-to-patient ratio and noso-comial infection risk in a pediatric cardiac in-tensive care unit. Prdiatric Infectious Disease Journal1997.

23. Rice JA: Mathematical Statistics and Data Analysis, Sec-ond Edition. Duxbury Press 1995.

24. Serfling RE: Methods for current statistical anal-ysis fo excess pneumonia-influenza deaths. PublicHealth Reports 1963, 78:494–506.

25. CDC: CDC Syndromic Surveillance site 2006, [http://www.cdc.gov/mmwr/pdf/wk/mm54su01.pdf].

26. Farrington C, Andrews N, Beale A, Catchpole M: A sta-tistical algorithm for the early detection of out-breaks of infectious disease. Journal of the RoyalStatistical Society. Series A (Statistics in Society) 1996,159(3):547–563.

27. Brillman JC, Burr T, Forslund D, Joyce E, PicardR, Umland E: Modeling emergency departmentvisit patterns for infectious disease complaints:results and application to disease surveillance.BMC Medical Informatics and Decision Making 2005,5:4:1–14, [http://www.biomedcentral.com/content/pdf/1472-6947-5-4.pdf].

28. Burkom HS, Murphy SP, Shmueli G: Auto-mated Time Series Forecasting for Bio-surveillance. Statistics in Medicine accepted 2007(available at http://www3intersciencewileycom/cgi-bin/abstract/114131913/).

29. Kleinman K, Lazarus R, Platt R: A generalized lin-ear mixed models approach for detecting incidentclusters of disease in small areas, with an applica-tion to biological terrorism. Am J Epidemiol 2004,159:217–224.

30. Bureau UC: U.S. Census Bureau X-12-ARIMA site2006, [http://www.census.gov/srd/www/x12a/].

31. Brockwell PJ, Davis RA: Time series: theory and meth-ods, 2nd ed. Springer, New York. 1987.

32. Muscatello D: An adjusted cumulative sum forcount data with day-of-week effects: applicationto influenza-like illness. Presentation at SyndromicSurveillance Conference 2004.

33. Chatfield C: The Holt-Winters Forecasting Proce-dure. Applied Statistics 1978, 27:264–279.

34. Holt CC: Forecasting seasonals and trends by ex-ponentially weighted averages. Tech. rep., CarnegieInstitute of Technology 1957.

35. Kulldorff M: Prospective time-periodic geograph-ical disease surveillance using a scan statistic.Journal of the Royal Statistical Society: Series A 2001,164:61–72.

36. Burkom HS, Elbert Y, Feldman A, Lin J: Role ofData Aggregation in Biosurveillance DetectionStrategies with Applications from ESSENCE.Morbidity and Mortality Weekly Report (MMWR)2004, 53 (suppl):67–73, [http://www.cdc.gov/mmwr/preview/mmwrhtml/su5301a16.htm].

12

TablesTable 1: Comparison statistics for sales of throat lozenges, before (“raw data”) and after preconditioningusing different methods

raw data regression log regress 7dayDiff 7dayDiff holi holt-wintersmean 909.12 -2.34e-13 -2.90e-15 17.60 10.15 10.19stdev 349.65 132.40 0.13 189.49 171.54 116.06weekendMean 962.88 0.64 0.01 20.68 11.43 25.50percentInMin 0.01 0.02 0.02 0.02 0.02 0.04pacfWeek 0.08 -0.01 0.00 -0.22 -0.33 0.07acfWeek 0.85 0.22 0.21 -0.09 -0.02 -0.02acfYear 0.28 0.07 0.12 0.05 0.03 0.04daysHighPacf 6 1 2 8 50 1skewness 0.25 0.77 -0.18 0.56 -0.17 0.14excessKurtosis -0.83 3.58 1.20 3.88 1.94 2.94

Table 2: Comparison statistics for gastrointestinal-related ED visits, before (“raw data”) and afterpreconditioning using different methods

raw data regression log regress 7dayDiff 7dayDiff holi holt-wintersmean 117.25 1.97e-14 3.31e-16 2.31 -0.04 0.96stdev 65.69 27.42 0.28 36.53 30.76 30.10weekendMean 30.03 1.28e-14 2.65e-16 -0.03 -0.03 -5.61percentInMin 0.02 0.02 0.02 0.03 0.03 0.03pacfWeek 0.79 0.19 0.11 -0.29 -0.38 0.15acfWeek 0.35 0.28 0.15 -0.00 -0.03 0.05acfYear 0.65 0.05 0.05 0.29 0.15 0.20daysHighPacf 35 7 7 7 35 35skewness -0.19 -1.01 -2.99 1.34 0.06 -0.55excessKurtosis -1.15 5.21 17.43 8.34 5.38 3.68

13

Figures

14

Figure 1: Sample Shewhart Control Chart

15

Figure 2: CuSum chart applied to OTC sales data, before (top) and after (bottom) preconditioning

16

Figure 3: Plots for detecting explainable patterns and comparing preconditioning methods for sales of throatlozenges, for the raw data (top row) and after preconditioning using different methods.

17

Figure 4: Plots for detecting explainable patterns and comparing preconditioning methods for GI-relatedED visits, for the raw data (top row) and after preconditioning using different methods.

18

Figure 5: Throat lozenge sales (black) and average temperature (red).

19

Figure 6: Residuals from a regression model for throat lozenges sales

20

Documents

Implementation and Comparison of Preprocessing Methods for ... · Implementation and Comparison of Preprocessing Methods for Biosurveillance Data Thomas Lotze1, Sean Murphy2 and Galit