15
PRELIMINARY REPORT FOR TASK 2 (Identify corridors of interest and acquire data and simulation models) to THE FLORIDA DEPARTMENT OF TRANSPORTATION TRAFFIC ENGINEERING AND OPERATIONS OFFICE On Project “Traffic-event Unification System Highlighting Arterial Roads” BDV31 TWO 977-97 Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida

Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

PRELIMINARY REPORT FOR TASK 2 (Identify corridors of interest and acquire data and simulation

models)

to

THE FLORIDA DEPARTMENT OF TRANSPORTATION TRAFFIC ENGINEERING AND OPERATIONS OFFICE

On Project “Traffic-event Unification System Highlighting Arterial

Roads”

BDV31 TWO 977-97

Sanjay Ranka Professor

Computer and Information Science and Engineering University of Florida

Page 2: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

1

DISCLAIMER

The opinions, findings, and conclusions expressed in this publication are those of the authors and not necessarily those of the State of Florida Department of Transportation.

Page 3: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

2

TABLE OF CONTENTS

TABLE OF CONTENTS 2

1. Introduction 3

2. Objectives 5

3. Datasets 6

3.1 ATSPM Data 6

3.2 HERE.com Data 8

4. Data Processing and Visualization 9

4.1 Libraries used 9

4.2 Data Re-sampling and Processing 9

4.3 Linking ATSPM and HERE.com Datasets via Bi-directional Mapping 10

4.4 Data Visualization 11

4.5 Visualization Tools 12

5. Conclusions 13

References 14

Page 4: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

3

1. Introduction

This report presents the supporting tasks and deliverables for Task 2 of the project. Task 2: Identify corridors of interest and acquire data and simulation models Data will be required to be stored in a database running machine learning simulations and understanding the type and frequency of missing information. Traffic data are full of noise due to various known and unknown factors. Cleaning and verifying the data for preparing for analytics is difficult and expensive task. Structured data (for example GPS latitude and longitude data) must be parsed and any missing data or inaccurate data has to be addressed. Unstructured data (for example, tweets, emails, Facebook postings, emails, video streams) may require parsing, natural language processing, interpretation and categorization (e.g. measuring sentiment). We will acquire the following data, clean it as necessary and develop a unified database that can be used to build machine learning solutions

Existing models developed for the Central Florida region Roadway geometry for the corridors of interest Demand profiles (volumes, turning movements, etc) for different “base conditions” (time

of day periods, days of the week, and seasons/special events) Signal timing parameters (green times, phasing, offset times, etc) for the same “base

conditions” as before Probe data from Bluetooth and other readers

The deliverables defined for this task were as follows:

Task 2 Deliverable: Upon completion of Task 2, the University will submit to the Research Center at [email protected] the following:

a written Technical Memorandum that will discuss the task in detail and include the following content:

o The problem or objective of the task o A discussion on the approach to solving the problem/task o The solution: datasets and streams used, tools, methodologies,

technologies, staffing and labor effort, involved in the analysis and solution o The results of the attempt to solve the problem or accomplish the objective o Any lessons learned or suggestions

a set of models and other resources acquired for this task The specific subtasks are outlined as follows along with the section numbers in the report that pertain to these subtasks.

o The problem or objective of the task (Section 2)

Page 5: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

4

o A discussion on the approach to solving the problem/task (Section 4) o The solution: datasets and streams used, tools, methodologies, technologies,

staffing and labor effort, involved in the analysis and solution (Section 3 and 4) o The results of the attempt to solve the problem or accomplish the objective

(Section 4) o Any lessons learned or suggestions (Section 4)

Page 6: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

5

2. Objectives

Intelligent transportation systems require data fusion and automatic incident detection as crucial components of their infrastructure: they are especially important in reducing congestion on freeways and urban arterial networks. The problem of integration of different information in a corridor is plagued by incompleteness, uncertainty, noise, rumors, latency and lack of relevance. Traffic accidents and related incidents cause economic loss (in addition to inconvenience and congestion). A system that can detect incidents in real time is a desirable way to reduce the impact of these events. The main objective of this task was to determine the appropriate data, acquire it and develop methods for preprocessing it. Based on discussion with the project managers, we decided to focus on two datasets

Here.com ATSPM Data

The choice was based on three criteria: availability of large amount of data, high frequency of update for potential use in real-time, and their potential effectiveness in predicting incidents. Traffic data are full of noise due to various known and unknown factors. Cleaning and verifying the data was performed and stored in a database running machine learning simulations.

Page 7: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

6

3. Datasets

In this section, we describe the two datasets we have been provided. These are fairly large datasets and provide traffic information for Seminole County, Greater Orlando Metropolitan Area.

3.1 ATSPM Data

Automated Traffic Signal Performance Measures (ATSPM)[1] data is a stream of data obtained by modern traffic intersection signal controllers. Induction loop detectors attached to the intersection collect data at 10 Hz, indicating whether a vehicle passed over it or not. Signal behavior is also captured. This data allows traffic engineers to analyze the performance of traffic intersections and improve safety and efficiency while cutting costs and congestion. The dataset contains data for 329 signal intersections in Seminole County, Greater Orlando Metropolitan Area. The data ranges for dates between 1st August 2018 to 31 October 2018. The total storage space occupied is ~ 220 GB and consists of several comma-separated value files. The data consists of 3 types of files:

● Raw Data Files: 22 such comma-separated value (.csv) files are available. Each is between 10-17 GB each and contains the data recorded at Deci second frequency. This raw data is stored in MySQL database. This raw data table has 3 months of raw data which accounts to ~200GB. This table is indexed by timestamp for fast retrieval. The data consists of 4 columns:

○ SignalID ○ Time of recording ○ EventCode: What event at the signal was captured ○ EventParam: What was the value of the event or attribute at that timestamp

Figure1: Raw Data File

Page 8: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

7

● Data Logging Requirements File: This file contains event code & parameter describe what each numeric value means in a table format (Figure 2).

Figure 2: Data Logging Requirements File

● ATSPM Additional Tables: This contains the intersection metadata (Figure 3). All the

additional files are joined into a single table. It gives compact representation of all the metadata.

○ “Approaches” Sheet describes the various approaches per signal. ○ “Signals” sheet gives physical coordinates of the signal. ○ “Detectors” sheet describes individual detectors, and which signals they are

associated with.

Figure 3: ATSPM additional tables after performing the join

Page 9: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

8

3.2 HERE.com Data

HERE.com [2] is a company that provides mapping and location services for road traffic based in Europe. HERE.com captures both static content such as road networks etc. as well as dynamic content such as traffic flows. HERE.com collects average speed of a part of the road network (called TMC, based on OpenStreetMap designations of road portions) by using streaming data from vehicles using HERE.com GPS and routing services. By aggregating this data from various vehicles, it is able to calculate an average speed for that link at a minute-by-minute resolution. HERE.com data set consists of data for Seminole County, Greater Orlando Metropolitan Area and tracks 1009 TMCs. Data since 2016 is available and can be queried for, by a web API. The dataset consists of:

● TMC Identification File: This file has details about the various TMCs including their OpenStreetMap identification code, their start and end positions (Figure 4). This file helps in identifying the location of the TMC.

Figure 4: TMC Identification File

● Seminole County Data File: This file contains the actual data. For every TMC, at a

minute resolution, the average recorded speed is reported. The reference speed is the desired speed that should be expected at the link (and not the maximum speed). Travel time indicates how long the vehicles actually took to cross the link. Confidence indicates the quality of data collection.

Figure 5: Seminole County Data File

Page 10: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

9

4. Data Processing and Visualization

Given the vast amount of data available, managing and processing it was a significant task. In this section we broadly discuss various sub-tasks that went into processing the datasets. Also, in order to get a “big picture” and make intuitive sense of the data, several visualization modalities were employed.

4.1 Libraries used

• NumPy [3] - Python package used for scientific computing and array processing • Pandas [4] - “Python Data Analysis Library”, pandas is an open source, BSD-licensed

library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

• MySQL [5] - Fast, easy-to-use relational database management system (RDBMS) • Python Multiprocessing [6] - spawning processes using an API similar to the threading

module. Pool object allows a simple means of parallelizing the execution of a function across multiple input values (data parallelism).

4.2 Data Re-sampling and Processing

The ATSPM dataset and HERE.com dataset contain data at 1 deci-second and 1-minute resolutions respectively. This leads to a very large dataset. While the technology allowed for the collection and storage of such high-resolution data, such resolution is not required for our project. Hence it is important to re-sample the data at useful timescales. Python multi-processing packages and Pandas Python data processing library are used to process data for several intersections simultaneously in a multi-core machine. Time requirement is approximately 5 min. for one week of data, 300 intersections on a machine with 54 cores, 256GB memory. The datasets were parsed, and any corrupted values were removed or suitably handled. The ATSPM dataset was re-sampled at fixed buckets or based on cycle length. For 8am to 8pm, the intervals correspond to a cycle or a fixed size interval. For 8pm to 8 am, the interval size is fixed to median of cycle lengths from 8am to 8pm (because cycle lengths during that period are significantly large and erratic) . The HERE.com dataset was re-sampled at 1min/2min buckets. For ATSPM data, controller log data is used to construct arrival volumes time series aggregated over each cycle. Flow counts per lane were calculated, for every approach direction of the intersection. Along with arrival volumes, time series of cycle lengths for every approach direction of the intersection is calculated. Data is stored for intervals that are of fixed duration or based on cycles.

Page 11: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

10

For HERE.com data, travel times, average speed over 2-minute bins is calculated. Congestion ratio is also calculated, which is the ratio of recorded speed to the reference speed for a link. A link with congestion ratio of 0.70 or below was considered to be congested. The processed data is written out and stored on the MySQL database for further use. We use this time series data to label the data to find potential incident intervals.

4.3 Linking ATSPM and HERE.com Datasets via Bi-directional Mapping

The key difference between the two datasets is that ATSPM dataset captures readings from the point of view of an intersection, whereas the HERE.com dataset captures readings from the point of view of uni directional road segments. Both use different naming and ordering schemes. However, both datasets contain geolocation coordinates (Latitude and Longitude): ATSPM data contains the Lat-Lon for the center point of the intersection whereas the HERE.com data contains Start and End Lat-Lon coordinates (with traffic flowing from Start to End). We used this fact to link the two datasets together. We start by filtering out road segments that are too small (E.g. below .25 miles) in the HERE.com data. Then for the remaining, we find that intersection coordinates that are the closest to the Start points. Similarly, we find intersection coordinates that are the closest to the End points. We also keep a track of the distance errors, as the Start and End points of the segments may not lie directly on the intersections. We then remove all segments that have a large error (E.g. above .2 miles). In this manner, we have an approximate but usable linking of datasets (Figure 6).

Figure 6: HERE.COM to ATSPM Mapping

Page 12: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

11

Figure 7: ATSPM Intersections and HERE road links combined

Using this approach, we can obtain an adjacency matrix for the intersections, with the names of the road links they are connected to each other (Figure 7). We apply the same algorithms for road links. Further we can also build a drivable adjacency list, in which we determine which road links could be a part of a coherent vehicle path, without an U-turns.

4.4 Data Visualization

We visualize the parsed data by using Matplotlib [7] Python library. Matplotlib is a Python 2D plotting library that supports a variety of hardcopy formats and interactive environments.

Figure 8: Arrival volumes aggregated per cycle for one week (Detector 139005)

Figure 9: Cycle Length over one week (Detector 139005)

Page 13: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

12

Figure 10: Arrival volumes aggregated over cycle for one week for multiple detectors on the

same approach Aggregated vehicle count from ATSPM data is visualized as time series shown below (Figure 8,9 and 10).

4.5 Visualization Tools

We developed a plotting tool which plots all the arrivals on one detector (or a subset of detectors) for an intersection for a given time interval. This tool helps us to visualize arrival volumes before and after a potential incident point (Figure 11).

Figure 11: Box plot of arrivals on phase 2 of Signal 1490 from 19:20 to 19:40 on 2018-08-21

p g

Page 14: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

13

5. Conclusions

In this report, we have described the ATSPM and HERE.com datasets. Next, we described how we parsed it and time series of arrival volumes, cycle lengths per lane for each approach of an intersection is constructed. For HERE.com, average speed, travel times time series is constructed. We described how we performed a join of ATSPM andHERE.com datasets. Time series visualizations is made using Matplotlib. A plotting tool is developed for visualization and debugging. We store this processed data in a MySQL database and will use it for developing machine learning algorithms for anomaly detection.

Page 15: Tushar Report 2 · Seminole County Data File: This file contains the actual data. For every TMC, at a minute resolution, the average recorded speed is reported. The reference speed

14

References

[1] “Automated Traffic Signal Performance Measures (ATSPMs).” U.S. Department of Transportation/Federal Highway Administration, www.fhwa.dot.gov/innovation/everydaycounts/edc_4/atspm.cfm. [2] “HERE Technologies.” HERE, HERE.com [3] NumPy is the fundamental package for scientific computing with Python www.numpy.org. [4] “Python Data Analysis Library.” Pandas, pandas.pydata.org/ [5] MySQL is an open source relational database management system https://www.mysql.com/. [6] “Python Multiprocessing” Process based threading interface https://docs.python.org/2/library/multiprocessing.html. [7] “Matplotlib” a plotting library for the Python programming language and its numerical mathematical extension NumPy https://matplotlib.org/.