Upload
pauline-harmon
View
213
Download
0
Embed Size (px)
Citation preview
Practical Aspects of Alerting Algorithms in Biosurveillance
Howard S. BurkomThe Johns Hopkins University Applied Physics Laboratory
National Security Technology Department
Biosurveillance Information Exchange Working Group
DIMACS Program/Rutgers UniversityPiscataway, NJ February 22, 2006
Outline
What information do temporal alerting algorithms give the health monitor?
How can typical data issues introduce bias or other misinformation?
How do spatial scan statistics and other spatiotemporal methods give the monitor a different look at the data?
What data issues are important for the quality of this information?
Conceptual approaches to Aberration Detection
What does ‘aberration’ mean? Different approaches for a single data source:
• Process control-based: “The underlying data distribution has changed” – many measures
• Model-based: “The data do not fit an analytical model based on a historical baseline” – many models
• Can combine these approaches
• Spatiotemporal Approach: “The relationship of local data to neighboring data differs from expectations based on model or recent history”
Comparing Alerting AlgorithmsCriteria:
• Sensitivity– Probability of detecting an outbreak signal– Depends on effect of outbreak in data
• Specificity ( 1 – false alert rate )– Probability(no alert | no outbreak )– May be difficult to prove no outbreak exists
• Timeliness– Once the effects of an outbreak appear in the
data, how soon is an alert expected?
Aggregating Data in Time
baseline interval
Used to get some estimate of normal data behavior• Mean, variance• Regression coefficients• Expected covariate distrib. -- spatial -- age category -- % of claims/syndrome
guardband
Avoids contamination of baseline with outbreak signal
Data stream(s) to monitor in time:
• Counts to be tested for anomaly
• Nominally 1 day• Longer to reduce
noise, test for epicurve shape
• Will shorten as data acquisition improves
test interval
Elements of an Alerting Algorithm– Values to be tested: raw data, or residuals from a model?
– Baseline period• Historical data used to determine expected data behavior• Fixed or a sliding window?• Outlier removal: to avoid training on unrepresentative data• What does algorithm do when there is all zero/no baseline data?• Is a warmup period of data history required?
– Buffer period (or guardband)• Separation between the baseline period and interval to be tested
– Test period• Interval of current data to be tested
– Reset criterion• to prevent flooding by persistent alerts caused by extreme values
– Test statistic: value computed to make alerting decisions
– Threshold: alert issued if test statistic exceeds this value
Rash Syndrome Grouping of Diagnosis Codes
www.bt.cdc.gov/surveillance/syndromedef/word/syndromedefinitions.docRash ICD-9-CM Code List
ICD9CM ICD9DESCR Consensus 050.0 SMALL POX, VARIOLA MAJOR 1 050.1 SMALL POX, ALASTRIM 1 050.2 SMALL POX, MODIFIED 1 050.9 SMALLPOX NOS 1 051.0 COWPOX 1 051.1 PSEUDOCOWPOX 1 052.7 VARICELLA COMPLICAT NEC 1 052.8 VARICELLA W/UNSPECIFIED C 1 052.9 VARICELLA NOS 1 057.8 EXANTHEMATA VIRAL OTHER S 1 057.9 EXANTHEM VIRAL, UNSPECIFI 1 695.0 ERYTHEMA TOXIC 1 695.1 ERYTHEMA MULTIFORME 1 695.2 ERYTHEMA NODOSUM 1 695.89 ERYTHEMATOUS CONDITIONS O 1 695.9 ERYTHEMATOUS CONDITION N 1 692.9 DERMATITIS UNSPECIFIED CA 2 782.1 RASH/OTHER NONSPEC SKIN E 2 026.0 SPIRILLARY FEVER 3 026.1 STREPTOBACILLARY FEVER 3 026.9 RAT-BITE FEVER UNSPECIFIED 3 051.2 DERMATITIS PUSTULAR, CONT 3 051.9 PARAVACCINIA NOS 3 053.20 HERPES ZOSTER DERMATITIS E 3
053.79 HERPES ZOSTER WITH OTHER SPECIF COMPLIC
3
053.8 H.Z. W/ UNSPEC. COMPLICATION 3 053.9 HERPES ZOSTER NOS W/O COM 3 054.0 ECZEMA HERPETICUM 3 054.79 HERPES SIMPLEX W/OTH.SPEC 3 054.8 HERPES SIMPLEX, W/UNS.COM 3 054.9 HERPES SIMPLEX NOS 3 055.79 MEASLES COMPLICATION NEC 3 055.8 MEASLES COMPLICATION NOS 3 055.9 MEASLES UNCOMPLICATED 3 056.79 RUBELLA COMPLICATION NEC 3 056.8 RUBELLA COMPLICATION NOS 3 056.9 RUBELLA UNCOMPLICATED 3 057.0 ERYTHEMIA INFECT.(5TH DIS 3 074.3 HAND/FOOT AND MOUTH DISEA 3 078.0 MOLLUSCUM CONTAGIOSUM 3 082.0 ROCKY MOUNTAIN SPOTTED FE 3 083.2 RICKETTSIALPOX 3 695.3 ROSACEA 3 695.4 LUPUS ERYTHEMATOSUS 3
Example: Daily Counts with Injected Cases
0
2
4
6
8
10
12
14
9/22/96 10/2/96 10/12/96 10/22/96 11/1/96 11/11/96 11/21/96 12/1/96 12/11/96 12/21/96
Encounter Date
Syn
drom
e C
ount
Rash_1
expected
event-attributable
Injected Cases Presumed
Attributable to Outbreak Event
Example: Algorithm Alerts Indicated
0
2
4
6
8
10
12
14
9/22/96 10/2/96 10/12/96 10/22/96 11/1/96 11/11/96 11/21/96 12/1/96 12/11/96 12/21/96Encounter Date
Syn
dro
me
Co
un
t
Rash_1expectedalertevent-attributable
Test Statistic Exceeds Chosen Threshold
EWMA Monitoring• Exponential Weighted
Moving Average• Average with most weight
on recent Xk:
Sk = S k-1 + (1-)Xk,
where 0 < • Test statistic:
Sk compared to expectation from sliding baseline
Basic idea: monitor
(Sk – k) / k
Exponential Weighted Moving Average
0
10
20
30
40
50
60
02/25/94 03/02/94 03/07/94 03/12/94 03/17/94 03/22/94 03/27/94 04/01/94
Daily Count
Smoothed
• Added sensitivity for gradual events• Larger means less smoothing
Example with Detection Statistic Plot
Statistic Exceeds Threshold
Threshold
Example: EWMA applied to Rash Data
Effects of Data Problems
missed event
Additionalflags
Importance of spatial data for biosurveillance
– Purely temporal methods can find anomalies, IF you know which case counts to monitor
• Location of outbreak?• Extent?
– Advantages of spatial clustering• Tracking progression of outbreak• Identifying population at risk
x
x
x
x
x
x
x
x
x x
x
x
x
Evaluating Candidate Clusters
x
xx
x
Surveillance RegionCandidate cluster:The scan statistic gives a measure of:“how unlikely is the number of cases inside relative to the number outside, given the expected spatial distribution of cases”
(Thus, a populous region won’t necessarily flag.)
x
x
x
x
x
x
x
x
x
x
x
x
x
Selecting Candidate Clusters
Searching for Spatial Clustering
• form cylinders: bases are circles about each centroid in region A, height is time
• calculate statistic for event count in each cylinder relative to entire region, within space & time limits
• most significant clusters: regions whose centroids form base of cylinder with maximum statistic
• but how unusual is it? Repeat procedure with Monte Carlo runs, compare max statistic to maxima of each of these
x
x
x
x
x
x
x
x
x
centroids of data collection regions
region A
x
x
x
Scan Statistic Demo
Scan Statistics: Advantages
• Gives monitor guidance for cluster size, location, significance
• Avoids preselection bias regarding cluster size or location
• Significance testing has control for multiple testing
• Can tailor problem design by data, objective:– Location (zipcode, hospital/provider site,
patient/customer residence, school/store address)– Time windows used (cases, history, guardband)– Background estimation method: model, history,
population, eligible customers
Surveillance ApplicationOTC Anti-flu Sales, Dates: 15-24Apr2002
Total sales as of 25Apr: 1804
potential cluster: center at 22311 63 sales, 39 exp. from recent data rel. risk = 1.6 p = 0.041
Distribution of Nonsyndromic Visits4 San Diego Hospitals
Effect of Data Discontinuities on OTC Cough/Cold Clusters
• Before removing problem zips, cluster groups are dominated by zips that “turn on” after sustained periods of zero or abnormally low counts.• After editing, more interesting cluster groups emerge.
Day
s
Zip (S to N)
School Nurse Data: All Visits
unreported
Cluster Investigation by Record Inspection
Records Corresponding to a Respiratory Cluster
Backups
Cumulative Summation Approach (CUSUM)
• Widely adapted to disease surveillance
• Devised for prompt detection of small shifts
• Look for changes of 2k standard deviations from the mean often k = 0.5)
• Take normalized deviation: often Zt = (xt –) /
• Compare lower, upper sums to threshold h:
SH,j = max ( 0, (Zt - k) + SH,j-1 )
SL,j = max ( 0, (-Zt - k) + SL,j-1 )
• Phase I sets h, k
ER Respiratory Claim Data
0
10
20
30
40
50
60
70
12/30 1/9 1/19 1/29 2/8 2/18 2/28
Date (2000-2001)
Nu
mb
er
of
Cas
es
Data
Smoothed
SH > 1
SL > 1
Upper Sum: Keep adding differences between today’s count and k std deviations above mean.
Alert when the sum exceeds threshold h.
CuSum Example: CDC EARS Methods C1-C3
Three adaptive methods chosen by National Center for Infectious Diseases after 9/1/2001 as most consistent
• Look for aberrations representing increases, not decreases• Fixed mean, variance replaced by values from sliding baseline (usually
7 days)
Baseline for C1-MILD (-1 to -7 day)
Baseline C2-MEDIUM (-3 to -9days)
Baseline for C3-ULTRA (-3 to -9 days)
CurrentCount
Day-9 Day-8 Day-7 Day-6 Day-5 Day-4 Day-3 Day-2 Day-1 Day 0
Calculation for C1-C3:
Individual day statistic for day j with lag n:
Sj,n = Max {0, ( Countj – [μn + σn] ) / σn}, where
μn is 7-day average with n-day lag
( so μ3 is mean of counts in [j-3, j-9] ), and
σn = standard deviation of same 7-day window
C1 statistic for day k is Sk,1 (no lag)
C2 statistic for day k is Sk,3 (2-day lag)
C3 statistic for day k is Sk,3 + Sk-1,3 + Sk-2,3
,where Sk-1,3 , Sk-2,3 are added if they do not exceed the threshold
Upper bound threshold of 2:
equivalent to 3 standard deviations above mean
Detailed Example, I
Fewer alerts AND more sensitive:
why?
Detailed Example, II
Signal Detected only with 28-day baseline
Detailed Example, III“the rest of the story”