Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data Rebecca Buchheit AIS Lab

Automated Procedures for Improving the Accuracy of Sensor-Based

Monitoring Data

Rebecca Buchheit

AIS Lab

Background

• sporadic use of KDD techniques in civil infrastructure

• relative youth of data mining research• difficult to systematically apply KDD process • KDD process tools (CRISP-DM) still under

development• KDD process highly domain dependent• time consuming to teach data mining analysts

domain knowledge

Research Objectives

• develop a framework for systematically applying KDD process to civil infrastructure data analysis needs– set of guidelines for inexperienced analysts– checklist for more experienced analysts

• describe intersection of KDD process characteristics and civil infrastructure– what problems are well-suited to KDD?– what characteristics are unique to

infrastructure?

Summary

• increased data collection => increased need to intelligently analyze data

• KDD process as a “power tool” for analyzing data for high-level knowledge

• civil infrastructure problems are well-suited to data mining but will need to apply entire KDD process to get good results

• proposed framework will help researchers to systematically apply KDD process to their data analysis problems

Data Quality

• What is it?– in this talk, “accuracy”– how close is the observed value to the true

value?– “ground truth” is rare– look for anomalous patterns

• Why is it important?– poor quality data may taint analyses– patterns of poor quality data may

overwhelm data mining/machine learning algorithms

Mn/ROAD Data • weigh-in-motion data– axle spacings and

weights, speed, lane, error codes

• derived quantities– equivalent standard axle

loads (ESALs)– FHWA vehicle type– gross vehicle weight– total vehicle length

• trucks only (type >= 4) • Jan 1 ‘98 to Dec 31 ’00• about 3 million vehicles

courtesy Mn/ROAD

Sample Data

Overview of Approach

• use statistical analysis and data mining algorithms to separate anomalies from normal data– clustering– regression– physical constraints– statistical properties

• focus on differences between anomalies and normal data to help discover causation

Clustering

• group data into “natural classes”

• anomalies separated from normal data

• used Autoclass clustering algorithm

Clustering Results

Regression

• confidence interval of 95%

• R-square (fit) = 0.923

• if error > 15% then identify as anomaly

∑ ESAL = (3.531±0.176) ∑vehicles –(1.252±0.099) ∑axles +(0.066±0.003) ∑GVW –139.000 ± 79.813

Regression Results

Binary Constraints (1)

constraint # violations (3,068,384 total)

offscale hit error 61,129 (1.99%)

significant weight difference error

11,107 (0.36%)

different axle counts error 69,521 (2.27%)

tailgating 10,211 (0.33%)

speed >= 64.37 km/h 51,114 (1.86%)

speed <= 128.74 km/h 3,723 (0.12%)

Binary Constraints (2)

constraint # violations (3,068,384 total)

gross weight <= 45,359kg

24,897 (0.81%)

length <= 22.86 m 79,454 (2.59%)

unknown vehicle type

190,191 (6.20%)

number of axles != 0 47 (0.00%)

number of axles <= 8 57,114 (1.86%)

Constraint Interactions

c1 c2 % interactions

slow speed length over limit 63.5%

length over limit slow speed 45.7%

tailgating unknown type 31.7%

high speed unknown type 28.7%

overweight diff axle counts 25.2%

tailgating slow speed 21.1%

tailgating length over limit 15.2%

Distribution Constraints

• use a goodness-of-fit test to compare distributions from the same day of week– length– gross weight– ESALs– lane

Anomaly Identification

• identify days with higher than normal concentrations of binary constraint violations

• identify days that are not likely to have come from the baseline distributions for length, ESALs, gross weight and lane

Binary Constraints Results

Distribution Constraints Results

A Quick Refresher

• used four different procedures to detect anomalies– clustering– regression– binary (physical) constraints– distribution constraints

• next up– what is causing the anomalies?– can we fix them?

Gross Vehicle Weight

Lane

What Happened?

• two vehicles traveling slowly and close together (tailgating) may be recorded as a single vehicle

• lightweight vehicles are tailgating cars– cars not supposed to be in database– mis-classified because of tailgating– this causes the “high” vehicle counts

• very heavy vehicles are tailgating trucks

• lane 1 (right-hand side) data is missing for all “low” vehicle count days

Can It Be Fixed? (1)

• removed all tailgating cars– lightweight– short– 2 or 3 axles– error code

• “halved” all tailgating trucks– very long– very heavy– more than 9

axles– error code

Can It Be Fixed? (2)

• inserted lane 1 vehicles from same time period in 2000

• “shifted” days to make sure day of week was constant– Tuesday Sept

8 1998 => Tuesday Sept 5 2000

Summary

• statistical analysis and data mining algorithms can be used to detect systematic anomalies in data– focus on differences between anomalies

and normal data to discover differences – need domain knowledge to understand

causation

Current Progress/Future Work

• integrate algorithms into data quality assessment program == automation– physical constraints– distribution constraints– other statistical characteristics of data– clustering– regression, neural networks

• will support infrastructure-related data collection activities

• use algorithms to identify and “clean” anomalies

Acknowledgements

• Minnesota Department of Transportation, especially Maggi Chalkline

• based upon work supported by the National Science Foundation, under Grant Numbers 9987871 and DGE 9553380

Documents

Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data Rebecca Buchheit AIS Lab