View
226
Download
3
Tags:
Embed Size (px)
Citation preview
Automated Procedures for Improving the Accuracy of Sensor-Based
Monitoring Data
Rebecca Buchheit
AIS Lab
Background
• sporadic use of KDD techniques in civil infrastructure
• relative youth of data mining research• difficult to systematically apply KDD process • KDD process tools (CRISP-DM) still under
development• KDD process highly domain dependent• time consuming to teach data mining analysts
domain knowledge
Research Objectives
• develop a framework for systematically applying KDD process to civil infrastructure data analysis needs– set of guidelines for inexperienced analysts– checklist for more experienced analysts
• describe intersection of KDD process characteristics and civil infrastructure– what problems are well-suited to KDD?– what characteristics are unique to
infrastructure?
Summary
• increased data collection => increased need to intelligently analyze data
• KDD process as a “power tool” for analyzing data for high-level knowledge
• civil infrastructure problems are well-suited to data mining but will need to apply entire KDD process to get good results
• proposed framework will help researchers to systematically apply KDD process to their data analysis problems
Data Quality
• What is it?– in this talk, “accuracy”– how close is the observed value to the true
value?– “ground truth” is rare– look for anomalous patterns
• Why is it important?– poor quality data may taint analyses– patterns of poor quality data may
overwhelm data mining/machine learning algorithms
Mn/ROAD Data • weigh-in-motion data– axle spacings and
weights, speed, lane, error codes
• derived quantities– equivalent standard axle
loads (ESALs)– FHWA vehicle type– gross vehicle weight– total vehicle length
• trucks only (type >= 4) • Jan 1 ‘98 to Dec 31 ’00• about 3 million vehicles
courtesy Mn/ROAD
Sample Data
Overview of Approach
• use statistical analysis and data mining algorithms to separate anomalies from normal data– clustering– regression– physical constraints– statistical properties
• focus on differences between anomalies and normal data to help discover causation
Clustering
• group data into “natural classes”
• anomalies separated from normal data
• used Autoclass clustering algorithm
Clustering Results
Regression
• confidence interval of 95%
• R-square (fit) = 0.923
• if error > 15% then identify as anomaly
∑ ESAL = (3.531±0.176) ∑vehicles –(1.252±0.099) ∑axles +(0.066±0.003) ∑GVW –139.000 ± 79.813
Regression Results
Binary Constraints (1)
constraint # violations (3,068,384 total)
offscale hit error 61,129 (1.99%)
significant weight difference error
11,107 (0.36%)
different axle counts error 69,521 (2.27%)
tailgating 10,211 (0.33%)
speed >= 64.37 km/h 51,114 (1.86%)
speed <= 128.74 km/h 3,723 (0.12%)
Binary Constraints (2)
constraint # violations (3,068,384 total)
gross weight <= 45,359kg
24,897 (0.81%)
length <= 22.86 m 79,454 (2.59%)
unknown vehicle type
190,191 (6.20%)
number of axles != 0 47 (0.00%)
number of axles <= 8 57,114 (1.86%)
Constraint Interactions
c1 c2 % interactions
slow speed length over limit 63.5%
length over limit slow speed 45.7%
tailgating unknown type 31.7%
high speed unknown type 28.7%
overweight diff axle counts 25.2%
tailgating slow speed 21.1%
tailgating length over limit 15.2%
Distribution Constraints
• use a goodness-of-fit test to compare distributions from the same day of week– length– gross weight– ESALs– lane
Anomaly Identification
• identify days with higher than normal concentrations of binary constraint violations
• identify days that are not likely to have come from the baseline distributions for length, ESALs, gross weight and lane
Binary Constraints Results
Distribution Constraints Results
A Quick Refresher
• used four different procedures to detect anomalies– clustering– regression– binary (physical) constraints– distribution constraints
• next up– what is causing the anomalies?– can we fix them?
Gross Vehicle Weight
Lane
What Happened?
• two vehicles traveling slowly and close together (tailgating) may be recorded as a single vehicle
• lightweight vehicles are tailgating cars– cars not supposed to be in database– mis-classified because of tailgating– this causes the “high” vehicle counts
• very heavy vehicles are tailgating trucks
• lane 1 (right-hand side) data is missing for all “low” vehicle count days
Can It Be Fixed? (1)
• removed all tailgating cars– lightweight– short– 2 or 3 axles– error code
• “halved” all tailgating trucks– very long– very heavy– more than 9
axles– error code
Can It Be Fixed? (2)
• inserted lane 1 vehicles from same time period in 2000
• “shifted” days to make sure day of week was constant– Tuesday Sept
8 1998 => Tuesday Sept 5 2000
Summary
• statistical analysis and data mining algorithms can be used to detect systematic anomalies in data– focus on differences between anomalies
and normal data to discover differences – need domain knowledge to understand
causation
Current Progress/Future Work
• integrate algorithms into data quality assessment program == automation– physical constraints– distribution constraints– other statistical characteristics of data– clustering– regression, neural networks
• will support infrastructure-related data collection activities
• use algorithms to identify and “clean” anomalies
Acknowledgements
• Minnesota Department of Transportation, especially Maggi Chalkline
• based upon work supported by the National Science Foundation, under Grant Numbers 9987871 and DGE 9553380