46
Driving School II Video Games for Autonomous Driving Independent Work Artur Filipowicz ORFE Class of 2017 Advisor Professor Alain Kornhauser [email protected] May 3, 2016 Revised August 27, 2016 1

Video Games for Autonomous Driving

Embed Size (px)

Citation preview

Driving School IIVideo Games for Autonomous Driving

Independent Work

Artur FilipowiczORFE Class of 2017

Advisor Professor Alain [email protected]

May 3, 2016

Revised

August 27, 2016

1

Abstract

We present a method for generating datasets to train neural networks and other statisticalmodels to drive vehicles. In [8], Chen et al. used a racing simulator called Torcs togenerate a dataset of driving scenes which they then used to train a neural network. Onelimitation of Torcs is a lack of realism. The graphics are plain and the only roadwaysare racetracks, which means there are no intersections, pedestrian crossings, etc. In thispaper we employ a game call Grand Theft Auto 5 (GTA 5). This game features realisticgraphics and a complex transportation system of roads, highways, ramps, intersections,traffic, pedestrians, railroad crossings, and tunnels. Unlike Torcs, GTA 5 has more carmodels, urban, suburban, and rural environments, and control over weather and time.With the control of time and weather, GTA 5 has an edge over conventional methods ofcollecting datasets as well.

We present methods for extracting three particular features. We create a function forgenerating bounding boxes around cars, pedestrians and traffic signs. We also presenta method for generating pixel maps for objects in GTA 5. Lastly, we develop a way tocompute distances to lane markings and other indicators from [8]

2

Acknowledgments

I would like to thank Professor Alain L. Kornhauser for hismentorship during this project and Daniel Stanley and BillZhang for their help over the summer and last semester.

This paper represents my own work in accordance with Universityregulations.

Artur Filipowicz

3

Contents

1 From The Driving Task to Machine Learning 61.1 The Driving Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 The World Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Datasets for the Driving Task 102.1 Cars, Pedestrians, and Cyclysis . . . . . . . . . . . . . . . . . . . . . . . 102.2 Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Observations on Current Datasets . . . . . . . . . . . . . . . . . . . . . . 112.4 Video Games and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Sampling from GTA 5 123.1 GTA 5 Scripts Development . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Test Car . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Desired Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Screenshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.5 Bounding Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.5.1 GTA 5 Camera Model . . . . . . . . . . . . . . . . . . . . . . . . 173.5.2 From 3D to 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.5.3 General Approach To Annotation of Objects . . . . . . . . . . . . 203.5.4 Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.5.5 Pedestrians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.5.6 Signs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.6 Pixel Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.7 Road Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.7.1 Notes on Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.7.2 Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.7.3 Road Network in GTA 5 . . . . . . . . . . . . . . . . . . . . . . . 293.7.4 Finding the Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Towards The Ultimate AI Machine 404.1 Future Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A Screenshot Function 42

List of Figures

1 Graphics and roads in Torcs. . . . . . . . . . . . . . . . . . . . . . . . . . 122 Graphics and roads in GTA 5. . . . . . . . . . . . . . . . . . . . . . . . . 12

4

3 Test Vehicle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 The red dot represents camera location. . . . . . . . . . . . . . . . . . . . 155 Camera model and parameters in GTA 5 . . . . . . . . . . . . . . . . . . 186 Two cars bounded in boxes. Weather: rain. . . . . . . . . . . . . . . . . 217 Two cars bounded in boxes. . . . . . . . . . . . . . . . . . . . . . . . . . 218 Traffic jam bounded in boxes. . . . . . . . . . . . . . . . . . . . . . . . . 229 Pedestrians bounded in boxes. . . . . . . . . . . . . . . . . . . . . . . . . 2210 Some of the traffic signs present in GTA 5. . . . . . . . . . . . . . . . . . 2311 Stop sign in bounding box. . . . . . . . . . . . . . . . . . . . . . . . . . . 2412 Traffic lights in bounding boxes. . . . . . . . . . . . . . . . . . . . . . . . 2413 Image with a bounding box. . . . . . . . . . . . . . . . . . . . . . . . . . 2714 Image with a pixel map for a car applied. . . . . . . . . . . . . . . . . . . 2815 List of indicators, their ranges and positions. Distances are in meters, and

angles are in radians. Graphic reproduced from [8]. . . . . . . . . . . . . 3016 Flags for links. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3117 Flags for nodes. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3118 Blue line represents where we want to collect data on lane location. . . . 3219 Red markers represent locations of vehicle nodes. . . . . . . . . . . . . . 3620 Red markers represent locations of vehicle nodes. Blue markers are ex-

trapolations of lane middles based on road heading and lane width. . . . 3721 Red markers represent locations of vehicle nodes. Blue markers are ex-

trapolations of lane middles based on road heading and lane width. Theblue marker in front of the test car represents where we want to measurelanes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

22 Node database entry design. . . . . . . . . . . . . . . . . . . . . . . . . . 3923 GTA V Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 41

5

1 From The Driving Task to Machine Learning

1.1 The Driving Task

The driving task is a physics problem of moving an object from point a ∈ R4 to b ∈ R4,with time being the fourth dimension, without colliding with any other object. Thereare also additional constraints in the form of lane markings, speed limits, and traffic flowdirections. Even with all constraints beyond avoiding collisions, the physical problem offinding a navigable path is easy given a model of the world. That is, if the location of allobjects and their shapes is known with certainty and the location of the constraints isknown, then the task becomes first the computation of a path in a digraph G representingthe road network and then for each edge finding unoccupied space and moving the objectinto it. All of these problems can be solved using fundamental physics and computerscience. What makes the driving task difficult in the real world setting is the lack of anaccurate world model. In reality we do not have omniscient drivers.

1.2 The World Model

People drive, and so do computers to a limited extent. Therefore, omniscience is notnecessary. Some subset of the total world model is good enough to perform the drivingtask. Perhaps with limited knowledge, it is only possible to successful complete the taskwith a probability less than 1, but the success rate is high enough for people to utilizethis form of transport.

To drive, we still need a world model. This model is constructed by the means of sensorfusion, the combination of information from several different sensors. In 2005, PrincetonUniversity’s entry in the DARPA Challenge, Prospect 11, used radar and cameras toidentify and locate obstacles. Based on these measurements and GPS data, the on-boardcomputer would create a world model and find a safe path. [4] In a similar approach, theGoogle Car uses radar and lidar to map the world around it. [14]

Approaches in [4], [14], and [29] appear rather cumbersome and convoluted comparedto the human way of creating a world model. Humans have 5 sensors, the eyes, thenose, the ears, the mouth, and the skin. In driving neither taste nor smell nor touch areused to build the world model as all of these senses are mostly cut off from the worldoutside the vehicle. The driver can hear noises from the outside. However, they can bemuffled by the sound of the driver’s own vehicle and many important objects, such asstreet signs and lane markings, do not make noise. To construct the world model humanspredominantly use one sensor, the eyes. We can suspect that there is enough informationencoded in visible light coming through the front windshield to build a world model goodenough for completing the driving task. However, research on autonomous vehicles - theconstruction of solution to the driving task using artificial intelligence - stays away from

6

approaching the problem in the pure vision way, as noted in [4] and [29]. The reason forthis is that vision, computer vision in particular, is difficult.

1.3 Computer Vision

Let X ∈ Rh∗w∗c be an image of width w and height h and c colors. As we stated earlier,X has enough information for a human to figure out where lane markings and othervehicles are, identify and classify road signs and perform other measurements to build aworld model. Perhaps, maybe several images in a sequence are necessary, although [8]shows that one image can be used to extract a lot of information. The difficult of computervision is that X is a matrix of numbers representing colors of pixels. In this representationan object can appear very different depending on lighting conditions. Additionally, dueto perspective, objects appear in different sizes and therefore occupy different number ofpixels, even if the object is the same. These are two of many variations which humanscan account for, but naive machine approaches fail.

Computer vision is difficult but not impossible. In recent decades, researches used ma-chine learning to enable computers to take X and construct more salient representations.

1.4 Machine Learning

The learning task is as follows; given some image Xi, we wish to predict a vector ofindicators Yi. Yi could be distances to lane markings, vehicles, locations of street singsetc. and can then be used to construct a world model. To that end, we want to train afunction f such that Yi = f(Xi). We say that Xi, Yi ∼ PX,Y .

The machine learning approach to this problem mimics humans in more then just thefocus on visual information. The primary method of learning images is the use of neuralnetworks, more specifically convolutional neural networks. These statistical models areinspired by neurons which make up the nerves and brain areas responsible for vision.The mathematical abstraction is represented as follows:

Let f(xi,W) be a neural network of L hidden layers. The sizes of these layers are l1 tolL.

f(xi,W) = gL(W(L)...g3(W(3)g2(W(2)g1(W(1)xi)))...)

W = {W(1),W(2), ...W(L)}

W(i) ∈ Rli+1× li

gi(x) is some activation function.

7

The process of training becomes the process of adjusting values of W(i). This firstrequires some loss function which expresses the error made by the network. A commonloss function is L2. We wish to create a neural network model f such that

minf

L2(T , f)

where let D be a dataset of n indicator Yi and image Xi pairs

D = {(Xi,Yi)}ni=1

and let R be the training set and let T be the test set.

R ⊂ D

T ⊂ D

R ∩ T = Ø

R∪ T = D

|R| = r

|T | = t

To minimize the loss function with respect to W, the most common method is theuse of Back-Propagation Algorithm [27]. Back-Propagation Algorithm uses stochasticgradient decent to find a local minimum of a function. At each iteration j of J , Back-Propagation Algorithm updates W

Wj+1 = Wj − η∂E(W)

∂Wj

The two sources of randomness in the algorithm are W0 and the order in which trainingexamples are used π. The initial values of element in matrices in W0 are uniform randomvariable. The ordering of examples is also often a random sample with replacement of J(xi,yi) ∈ R

On an intuitive level, the network adjusts W to extract useful features from the imagepixel values Xi. In the process it builds the distribution Xi, Yi ∼ PX,Y . In theory thelarger the W the more capacity the network has for extracting and leaning features andrepresenting complex distributions. At the same time, it is also more likely to fit noise inthe data a nonsalient features such as clouds. This overfitting causes poor generalizationand we need a network which can generalize to many driving scenes. There are severalregularization techniques to overcome overfitting. These include L1, L2, dropout, andothers. However, these will only be effective if we do have the data adequately representthe domain of PX,Y . This domain for driving scenes is huge considering it includes

8

images of all the different kinds of roads, vehicles, pedestrians, street signs, traffic lights,intersections, ramps, lane marking, lighting conditions, weather conditions, times of dayand positions of the camera. [8] tested a network in a limited subset of these conditionsand they used almost half million images for training.

9

2 Datasets for the Driving Task

Machine learning for autonomous vehicles has been studied for years. Therefore, severaldatasets already exist. These datasets come in two types. There are datasets of objectsof interest in the driving scenes which include vehicles (cars, vans, trucks), cyclists,pedestrians, traffic lights, lane markings and street signs. Usually, these datasets providecoordinates of bounding boxes around the objects Yi. These are useful for traininglocalization and classification models. The second type of datasets provide distances tolane markings, cars and pedestrians. These are used to train regression models. Here wewill give a brief overview of several of these datasets.

2.1 Cars, Pedestrians, and Cyclysis

Daimler Pedestrian Segmentation Benchmark Dataset contains 785 images of pedestri-ans in an urban environment captured by a calibrated stereo camera. The groundtruthconsists of true pixel shape and disparity map. [11]

CBCL StreetScenes Challenge Framework contains 3,547 images of driving scenes cap-tured with bounding boxes for 5,799 cars, 1,449 pedestrians, 209 cyclists, as well asbuildings, roads, sidewalks, stores, tree, and the sky. The images have been captured byphotographers from street, crosswalk, and sidewalk views. [7]

KITTI Object Detection Evaluation 2012 contains 7481 training images and 7518 testimages with each image containing several objects. The total number of objects is 80,256,including cars, pedestrians, and cyclists. The groundtruth includes a bounding box forthe object as well as an estimate of the orientation in the bird’s eye view. [13]

Caltech Pedestrian Detection Benchmark contains 10 hours of driving in an urbanenvironment. The groundtruth contains 350,000 bounding boxes for 2300 unique pedes-trians. [9]

There are several datasets for street signs [23], [21], and [15]. However, these datasetshave been made in European countries and therefore they contain European signs whichare very different from their US counterparts. Luckily [24] is a dataset of 6,610 imagescontaining 47 different US road signs. For each sign the annotation includes sign type,position, size, occluded (yes/no), and on side road (yes/no).

2.2 Lanes

KITTI Road/Lane Detection Evaluation 2013 has 289 training and 290 test imagesof road lanes with groundtruth consisting of pixels map the road area and the lane the

10

vehicle is in. The dataset contains images from three environment urban with unmarkedlanes, urban with marked lanes and urban with multiple marked lanes. [12]

ROMA lane database has 116 images of different roads with groundtruth pixel positionsof visible lane markings. The camera calibration specifies the pixel distance to truehorizon and conversions between pixel distances and meters. [30]

2.3 Observations on Current Datasets

The above datasets are quite limited. First, most of them are small when compared tothe half a million images used in [8]. Second, they do not represent many of the drivingconditions such as different weather conditions or times of day; the reason for this is thatmeasuring equipment, especially cameras, can only function in certain conditions. Sincethis tends to be sunny weather, most of these datasets are collected during such times.Additionally, all of these datasets include some amount of manual labeling which is notfeasible when the dataset includes millions of images.

2.4 Video Games and Datasets

The problems associated with the datasets would be resolved if we could somehowsample from PX,Y both Xi and Yi without having to spend time to measure Yi. This isnot possible in the real world. However, [8] decided to use a virtual world, a racing videogame called Torcs [6]. The hope behind this approach is that the game can simulatePX,Y well enough so that the network, once trained, will be able to generalize to the realworld. Let us assume that this is true.

The main benefit of using Torcs and other video games is access to the game engine.This allows us to extract the true Yi for each Xi we harvest from the screen. Torcs itselfhas several restrictions which limit it from simulating the range of driving conditionspresent in the real world. Fundamentally it is a racing game with circular, one-waytracks. The weather and lighting conditions are fixed. The textures are rather simpleand thus unrealistic.

To overcome these limitations and allow for a more diverse and realistic dataset, wefocus on the game called Grand Theft Auto 5 (GTA5). Unlike Torcs, the makers of GTA5had the funds to create a very realistic world since they were developing a commercialproduct and not an open-source research tool. GTA5 has hundreds of different vehicles,pedestrians, freeways, intersections, traffic signs, traffic lights, rich textures, and manyother elements which create a realistic environment. Additionally, GTA5 has about 14weather conditions and simulates lighting conditions for 24 hours of the day. To tap intothese features, the next section examines ways of extracting various data.

11

Figure 1: Graphics and roads inTorcs.

Figure 2: Graphics and roads inGTA 5.

3 Sampling from GTA 5

3.1 GTA 5 Scripts Development

GTA 5 is a closed source game. There is no out-of-the-box access to the underlyinggame engine. However, due to the game’s popularity, fans have hacked into it anddeveloped a library of functions for interacting with the game engine. This is doneby the use of scripts loaded into the game. The objective of this paper is not to givetutorial on coding scripts for GTA 5, and as such we will keep the discussion of codeto a minimum. However, we will explain some of the code and game dynamics for thepurpose of reproducibility and presentation of the methods used to extract data.

Two tools are needed to write scripts for GTA 5. The first tool is ScritHook by Alexan-der Blade. This tool can be downloaded from: https://www.gta5-mods.com/tools/script-hook-v or http://www.dev-c.com/gtav/scripthookv/. It comes with a useful trainerwhich provides basic control over many game variables including weather and time. Thenext tool is a library called Script Hook V .Net by Patrick Mours which allows us to useC# and other .Net languages to write scripts for GTA 5. The library can be downloadedfrom https://www.gta5-mods.com/tools/scripthookv-net. For full source code and listof functions please see https://github.com/crosire/scripthookvdotnet.

3.2 Test Car

To make the data collection more realistic we will use an in-game vehicle, the test car,with a mounted camera; similar to [13]. The vehicle model for the test car was pickedarbitrarily and can be replaced with any other model. Besides the steering controls, weintroduce 3 new functions for the following keys: NumPad0, ”I”, and ”O”. NumPad0spawns a new instance of our test car. ”I” mounts the rendering camera on the test car.

12

Figure 3: Test Vehicle

”O” restores the control of the rendering camera back to the original state. Let us lookat the some of the code for the test car.

The TestVehicle() function is a constructor for the TestVehicle class. It is called oncewhen all of the scripts are loaded. This occurs at the start of the game and can betriggered at any point in the game by hitting the ”insert” key. This constructor gainscontrol of the camera which is rendering the game by destroying all cameras and creatinga new rendering camera. The function responsible for this is World.CreateCamera. Thefirst two arguments represent position and rotation. The last argument is the field ofview in degrees. We set it to 50, however this could be changed to fit the parameters ofa real world camera.

It is important to note GTA.Native.Function.Call. GTA 5’s game engine has thousandsof native functions which were used by the developers to build the game. This libraryencapsulates some of them. Others can be called using GTA.Native.Function.Call wherethe first argument is the hash code of the function to call and the remaining argumentsare the arguments to pass to the native function. One of the biggest challenges in thisproject is figuring out what these other arguments represent and control. There are

13

online databases where players of the game list known functions and parameters. Thesedatabases are far from complete. Therefore, for some of these native function calls, someof the arguments may not have any justification besides that they make the functionwork. This is the price paid for using a closed source game.

public TestVehicle()

{

UI.Notify("Loaded TestVehicle.cs");

// create a new camera

World.DestroyAllCameras();

camera = World.CreateCamera(new Vector3(), new Vector3(), 50);

camera.IsActive = true;

GTA.Native.Function.Call(Hash.RENDER_SCRIPT_CAMS, false, true,

camera.Handle, true, true);

// attach time methods

Tick += OnTick;

KeyUp += onKeyUp;

}

The camera position and rotation do not matter in the previous function as they willbe dynamically updated to keep up with the position and rotation of the car. This isaccomplished by updating both properties at everything tick of the game. A tick is aperiodic call of the OnTick function. On each tick, we will keep the camera following thecar by setting its rotation and position to be that of the test car. The position of thecamera is offset by 2 meters forward and 0.4 meters up relative to the center of the testcar. This places the camera on the center of the hood of the car as seen in Figure 4.

// Function used to keep camera on vehicle and facing forward on each tick step.

public void keepCameraOnVehicle()

{

if (Game.Player.Character.IsInVehicle())

{

// keep the camera in the same position relative to the car

camera.AttachTo(Game.Player.Character.CurrentVehicle,

new Vector3(0f, 2f, 0.4f));

// rotate the camera to face the same direction as the car

camera.Rotation = Game.Player.Character.CurrentVehicle.Rotation;

}

}

14

Figure 4: The red dot represents camera location.

void OnTick(object sender, EventArgs e)

{

keepCameraOnVehicle();

}

3.3 Desired Functions

Being inside the game with our test vehicle, we want to collect training data. Existingdatasets provide good inspiration for what should be collected. A common datum is thecoordinates of bounding boxes for objects such as cars as in [7], [9] and [13] and trafficsigns as in [23], [21], [15] and [24]. Pixel maps representing areas in the image where cer-tain objects are also common. ROMA [30] has pixel of lane marking. KITTI Road/LaneDetection Evaluation 2013 [12] has pixel of road areas marked. Daimler Pedestrian Seg-mentation Benchmark Dataset [11] has pixel of pedestrians marked. Lastly, we wouldlike to make measurements of distances to lanes and cars in a framework from [8]. Thefollowing sections describe ways of collecting the above information for X, Y data pairs.

15

3.4 Screenshots

To collect X, we take a screen shot of the game. GTA 5 runs only on Windows.Using Windows user32.dll functions GetForegroundWindow, GetClientRect, and Client-ToScreen, we can extract the exact area of the screen where the game appears. Neuralnetworks take small, usually 100 pixels by 200 pixels, images as input, we set the gameresolution to be as small as possible and let h = IMAGE HEIGHT = 600 pixels and w =IMAGE WIDTH = 800 pixels. These could be furthered scaled down to fit a particularmodel such as [8]. For implementation please see Appendix A.

3.5 Bounding Boxes

A bounding box is a pair of points which defines a rectangle which encompasses anobject in an image. Let b = {(xmin, ymin), (xmax, ymax)} be a bounding box, wherexmin, ymin, xmax, ymax are coordinates in an image in pixels with the upper left cornerbeing the origin. The task of creating bounding boxes includes computing the extremesof a 3 dimensional object and enclosing them in a rectangle. The algorithm for doingthis is very simple.

Algorithm 1 Algorithm for computing a bounding box.

Require: Model m and center c of an object1: get the dimensions of m→ (h,w, d)2: compute unit vectors with respect to the object (ex, ey, ez)3: using ex, ey, ez and h,w, d compute the set of vertices v of a cube enclosing the object4: map each point p ∈ v to the viewing plane using g : R3 → R3 to create set z5: xmin = min

xz

6: xmax = maxx

z

7: ymin = minx

z

8: ymax = maxy

z

9: if xmin < 0 then xmin = 010: if xmax > IMAGE WIDTH then xmax = IMAGE WIDTH11: if ymin < 0 then ymin = 012: if xmax > IMAGE HEIGHT then ymax = IMAGE HEIGHT

In GTA 5 it is very easy to compute ex, ey, ez and get h,w, d for models for cars,pedestrians, and traffic signs. Therefore, it is easy to create a bounding cube around anobject. The code excerpt below details the calculation. e is the object we wish to boundand dim is a vector of the dimensions of the model h,w, d.

Vector3[] vertices = new Vector3[8];

16

vertices[0] = FUL;

vertices[1] = FUL - dim.X*e.RightVector;

vertices[2] = FUL - dim.Z*e.UpVector;

vertices[3] = FUL - dim.Y*Vector3.Cross(e.UpVector, e.RightVector);

vertices[4] = BLR;

vertices[5] = BLR + dim.X*e.RightVector;

vertices[6] = BLR + dim.Z*e.UpVector;

vertices[7] = BLR + dim.Y*Vector3.Cross(e.UpVector, e.RightVector);

There is a function called WorldToScreen which takes a 3 dimensional point in theworld and computes that points location on the screen. Unfortunately, this functionreturns the origin if a point is not visible on the screen. This is a problem as we wantto draw a bounding box even if part of the object is out of view, a car coming in onthe left for example. In these cases we want the bounding box to extend to the edgeof the screen. The simplest solution is to map all points to the viewing plane which isinfinite and follow the algorithm above. This requires a custom g function and a goodunderstanding of the camera model.

3.5.1 GTA 5 Camera Model

Let’s first establish some terminology. Let e ∈ R3 be the location of the observer andlet c ∈ R3 be a point on the viewing plane, the plane where the image of the world isformed, such that vector p from e to c represents the direction the camera is pointingand is perpendicular to the viewing plane. Additionally, let θ be a rotation vector of thecamera relative to the world coordinates. After a lot of experimentation, we determinedthat the position property of the camera object in GTA 5 refers to e. θ measures anglescounterclockwise in degrees. When θ = 0, the camera is facing down the positive y-axisand the view plane is thus the xz-plane. The order of rotation from this position isaround x-axis then y-axis and then z-axis.

3.5.2 From 3D to 2D

Based on the information about the camera model, we can take a 3 dimensional pointin the world and then map it to the viewing plane and then transform it to screen pixels.Let a ∈ R3 be the point we wish to map. First we must transform this point to thecamera coordinates. This is accomplished by rotating a using the equations below andsubtracting c, the subtraction is omitted.

17

Figure 5: Camera model and parameters in GTA 5

dxdydz

=

cos(θx) −sin(θx) 0sin(θx) cos(θx) 0

0 0 1

cos(θy) 0 sin(θy)0 1 0

−sin(θy) 0 cos(θy)

1 0 00 cos(θx) −sin(θx)0 sin(θx) cos(θx)

axayaz

dx = cos(θz)[axcos(θy) + sin(θy)[aysin(θx) + azcos(θx)]]− sin(θz)[aycos(θx)− azsin(θx)]

dy = sin(θz)[axcos(θy) + sin(θy)[aysin(θx) + azcos(θx)]] + cos(θz)[aycos(θx)− azsin(θx)]

dz = −axsin(θy) + cos(θy)[aysin(θx) + azcos(θx)]

We also need to rotate the vector representing the z direction in the world, vup,world andthe vector representing the x direction in the world, vx,world. We also need to computethe width and hight of the region of the view plane which is actually displayed on screen.We call this region the view window. In the equations below F is the field of view inradians and dnear clip is the distance between c and e.

viewWindowHeight = 2 ∗ dnear cliptan(F/2)

viewWindowWidth =IMAGE WIDTH

IMAGE HEIGHT∗ viewWindowHeight

18

We then compute the intersection point between vector d − e and the viewing plane,call it pplane. We translate the origin to the upper left corner of the view window andupdate pplane to p

plane.

newOrigin = c+viewWindowHeight

2∗ vup,camera −

viewWindowWidth

2∗ vx,camera

p′

plane = (pplane + c)− newOrigin

Next we calculate the coordinates of p′

plane in the two dimensions of the plane.

viewP laneX =p′Tplanevx,camera

vTx,cameravx,camera

viewP laneZ =p′Tplanevup,camera

vTup,cameravup,camera

Finally we scale the coordinates to the size of the screen. UI.WIDTH and UI.HEIGHTare in-game constants.

screenX =viewP laneX

viewWindowWidth∗ UI.WIDTH

screenY =−viewP laneZ

viewWindowHeight∗ UI.HEIGHT

The process is summarized below.

Algorithm 2 get2Dfrom3D: Algorithm for computing screen coordinates of a 3D point.

Require: a1: translate and rotate a into camera coordinates point d2: rotate vup,world, vx,world to vup,camera, vx,camera3: compute viewWindowHeight, viewWindowWidth4: find intersection of d− e with the viewing plane5: translate origin of the viewing plane6: calculate the coordinates of the intersection point in the plane7: scale the coordinates to screen size in pixels

19

3.5.3 General Approach To Annotation of Objects

The main objective is to draw bounding boxes around objects which are within acertain distance. There exist functions GetNearbyVehicles, GetNearbyPeds, and Get-NearbyEntities. These functions allows us to get an array of all cars, pedestrians andobjects in an area around the test car. Each object can be tested individually to see ifit is visible on the screen. We created a custom function for doing so as the in gamefunction has unreliable behavior. This function works by checking if it is possible todraw a strait line between e and at least one of the vertices of the bounding cube withouthitting any other object. The name of this methods is ray casting and it will be discussedin more detail later. It must be noted that in the hierarchy of the game, pedestrians andvehicles are also entities. Therefore a filtering process is applied when bounding signs.This process is discussed in the signs section.

3.5.4 Cars

Compared to TORCS, GTA 5 has almost ten time more car models. There are 259vehicles in GTA V (See http://www.ign.com/wikis/gta-5/Vehicles for the complete list).There vehicles are of various shapes and sizes, from golf carts to truck and trailers. Thisdiversity is more representative of the real distribution of vehicles and can hopefully beutilized to train more accurate neural networks. The above method can put a boundingbox around any of these vehicles. Please see Figures 6, 7, and 8 for examples.

3.5.5 Pedestrians

Pedestrians can also be bounded for classification and localization training. GTA 5has pedestrians of various genders and ethnicities. More importantly, the pedestrians inGTA 5 perform various actions like standing, crossing streets, sitting etc. This creates alot of diversity for training. The draw back of GTA 5 is that all pedestrians are aboutthe same height.

3.5.6 Signs

As mentioned before, signs are a bit more tricky to bound. There are two reasons forthis. First, the only way to find get signs which are around the test vehicle is to get allentities. This includes cars, pedestrians, and various miscellaneous props, many of which

20

Figure 6: Two cars bounded in boxes. Weather: rain.

Figure 7: Two cars bounded in boxes.

21

Figure 8: Traffic jam bounded in boxes.

Figure 9: Pedestrians bounded in boxes.

22

Sign Description DOT Id [3] GTA Picture

Stop Sign R1-1

Yield Sign R1-2

One Way Sign R6-1

No U-Turn Sign R3-4

Freeway Entrance D13-3

Do Not Enter Wrong Way Sign R5-1 and R5-1a

Figure 10: Some of the traffic signs present in GTA 5.

are of no interest. Thus we need to check each entity for its model to see if it is a trafficsign. To do so, we need a list of all of the models of all traffic signs in GTA 5. This listwould include many of the signs listed in Uniform Traffic Control Devices [3]. See Figure10 for some of the signs in GTA 5.

The second difficulty with traffic signs is that they may require more than one boundingbox. For example, a traffic light may have several lights on it, see figure 12. This leads tothe idea of spaces of interest, or SOP. One sign model may have several space of interestwe wish to bound.

23

Figure 11: Stop sign in bounding box.

Figure 12: Traffic lights in bounding boxes.

24

There is an elegant solution to both problems. The solution is a database of spacesof interest. Every entry contains a model hash code, name of the sign, and the x,y,zcoordinates of the front upper left and back lower right vertices of the bounding cube.Which such a database, the algorithm for bounding sign is as follows:

Algorithm 3 Algorithm for bounding signs.

Require: d - database of spaces of interest1: read in d2: get array of entities e from GetNearbyEntities3: for each entity in e do4: check if the model of the entity matches any hash codes in d5: get all the matching spaces of interest6: for each space of interest do7: draw a bounding box

3.6 Pixel Maps

Pixel maps are more refined bounding boxes. Instead of marking an entity with fourpixels, we mark it with every pixel it occupies on the screen. This can be done easily whenwe start with a bounding box b = {(xmin, ymin), (xmax, ymax)} and invert the functionwhich maps 3 dimensional point to the screen. The inverse of g can be constructedas follows. Given a screenX and screenY in pixels, we transform the pixel values tocoordinates on the viewing plane. Next, we transform the point on the viewing planeinto a point in the 3 dimensional world, pworld.

viewP laneX =screenX

UI.WIDTH∗ viewWindowWidth

viewP laneZ =−screenY

UI.HEIGHT∗ viewWindowHeight

pworld = viewP laneX ∗ vx,camera + viewP laneZ ∗ vup,camera + newOrigin

Once we compute pworld, we use Raycast function to get the entity which occupies thatpixel. They Raycast function requires a point of origin, in our case e, a direction, in ourcase pworld− e and a maximum distance the ray should travel, which we could set to be avery large number like 10,000. If the entity returned by Raycast matches the entity thebounding box encloses, then we added the pixel to the map.

25

Algorithm 4 Algorithm for computing a pixel map of an entity.

Require: entity, b = {(xmin, ymin), (xmax, ymax)}1: let map be a boolean array IMAGE WIDTH by IMAGE HEIGHT2: for x ∈ {xi|xi ∈ Z, xmin ≤ xi ≤ xmax} do3: for y ∈ {yi|yi ∈ Z, ymin ≤ yi ≤ ymax} do4: compute pworld of x, y5: Raycast from e in direction of pworld − e to get entityRaycast6: if entity = entityRaycast then7: set map[x, y] to true

Depending on the application, these maps can be combined together using the ORboolean function. The function for pixel maps is yet to be implemented due to timeconstraints. Besides being a trivial extension of bounding boxes, it is also less useful formachine learning due to a cumbersome and perhaps unnecessarily complex representationof objects. Figures 13 and 14 show what the result of such a function would look like.

3.7 Road Lanes

Identifying and locating cars, pedestrians, and signs will only help with a part of thedriving task. Even without any of these things present, drivers must still stay within aspecified lane. Ultimately, locating the lanes and the vehicle’s position in them is thefoundation of the driving task. We will explore a method for extracting informationsimilar to [8] from GTA 5.

3.7.1 Notes on Drivers

First, let’s examine how real drivers collect information on lane positions. There isample literature on the topic. The general consensus is that humans look ahead about 1second to locate lanes. [10] [20] [19] This time applies for speeds between 30 km\h and60 km\h [10] [19] and corresponds to a distance of about 10 meters. In a more detailedmodel, human drivers have 2 distances at which they collect information. At 0.93 s or15.7 m road curvature information is collected [19] and at 0.53 or 9 m position in laneis collected [19]. Near information is used to fine tune driving and is sufficient at lowspeeds [19]. At high speeds, the further level is used for guidance and stabilization [10].Divers also look about 5.5 degrees below the true horizon for road data. [19] For curves,humans use a tangent point on the inside of the curve for guidance [20]. They locate thispoint 1 to 2 seconds before entering the curve. [20]

26

Figure 13: Image with a bounding box.

27

Figure 14: Image with a pixel map for a car applied.

28

3.7.2 Indicators

From literature on human cognition, we know where people look for information onroad lanes. In [8], we find a very useful model on what information to collect. Chenyi etal. system uses 13 indicators for navigating down a highway like racetrack. While thisroadway is very simple compared to real world road which have exits, entrances, sharedleft turn lanes, and lane merges, the indicators are quite universal. Figure 15 lists theindicators, their descriptions, and ranges.

3.7.3 Road Network in GTA 5

The GTA 5 road network is composed of 74,530 nodes and 77,934 links. [2] For eachnode there are x, y, z coordinates and 19 flags and each link consists of 2 node ids and 4flags. [2] This information is contained in paths.ipl. Figures 16 and 17 show which flagsare currently known. It does not appear that any of these flags would be particularlyuseful to figuring out the location of the lane markings.

The Federal Highway Administration sets lane width for freeway lane at 3.6 m (12feet) and for local roads between 2.7 m and 3.6 m. Ramps are between 3.6 and 9 m (12to 30 feet). [1]. Based on measurements, the lanes in GTA 5 are 5.6 meters wide. Thisshould not be a problem when the network is applied to real world applications since theoutput can always be scaled.

3.7.4 Finding the Lanes

We know what information we would like to collect and we know that we want tocollect it at a point in the road about 10 meters in front of the test car. Figure 18represents our data collection situation. We want to compute where the lanes are at blueline. Assuming we could locate the markings for the left, right and middle lanes, wecould then see if there are any cars whose positions fall between these points. The carswould also have to be visible on the screen and no further then some maximum distance.Following [8], this distance d could be 70 meters.

We can compute the indicators if we know the position of the lanes and the headingof the road. Let h be the heading vector of the road at the 10 meter mark. Let LL, ML,MR, and RR be points on the lane markings where the blue line intersects the lanes.Let f be a point on the ground at the very front of the test vehicle, possibly below thecamera. We will perform the calculation for the three lane indicators are the two laneindicator can be filled in with values set based on these indicators. The angle is simplythe angle between the test car heading vector, hcar and the road heading vector.

29

IndicatorsIndicator Description Min Value Max Value

angleangle between the cars heading andthe tangent of the road

-0.5 0.5

dist Ldistance to the preceding car in theleft lane

0 75

dist Rdistance to the preceding car in theright lane

0 75

toMarking L distance to the left lane marking -7 -2.5toMarking M distance to the central lane marking -2 3.5toMarking R distance to the right lane marking 2.5 7

dist LLdist LL: distance to the preceding carin the left lane

0 75

dist MMdist MM: distance to the precedingcar in the current lane

0 75

dist RRdist RR: distance to the preceding carin the right lane

0 75

toMarking LLdistance to the left lane marking ofthe left lane

-9.5 -4

toMarking MLdistance to the left lane marking ofthe current lane

-5.5 -0.5

toMarking MRdistance to the right lane marking ofthe current lane

0.5 5.5

toMarking RRdistance to the right lane marking ofthe right lane

4 9.5

Figure 15: List of indicators, their ranges and positions. Distances are in meters, andangles are in radians. Graphic reproduced from [8].

30

Flag Meaning0 0 (primary) or 1 (secondary or tertiary)1 0 (land), 1 (water)2 unknown (0 for all nodes)3 unknown (1 for 65,802 nodes, otherwise 0, 2, or 3)4 0 (road), 2 (unknown), 10 (pedestrian), 14 (interior), 15 (stop), 16 (stop), 17 (stop),

18 (pedestrian), 19 (restricted)5 unknown (from 0/15 to 15/15)6 unknown (0 for 60,111 nodes, 1,141 other values)7 0 (road) or 1 (highway or interior)8 0 (primary or secondary) or 1 (tertiary)9 0 (most nodes) or 1 (some tunnels)10 unknown (0 for all nodes)11 0 (default) or 1 (stop - turn right)12 0 (default) or 1 (stop - go straight)13 0 (major) or 1 (minor)14 0 (default) or 1 (stop - turn left)15 unknown (1 for 10,455 nodes, otherwise 0)16 unknown (1 for 32 nodes, otherwise 0, on highways)17 unknown (1 for 62 nodes, otherwise 0, on highways)18 unknown (1 for 92 nodes, otherwise 0, some turn lanes)

Figure 16: Flags for links. [2]

Flag Meaning0 unknown (-10, -1 to 8 or 10)1 unknown (0 to 4 or 6)2 0 (one-way), 1 (unknown), 2 (unknown), 3 (unknown)3 0 (unknown), 1 (unknown), 2 (unknown), 3 (unknown), 4 (unknown), 5 (unknown), 8

(lane change), 9 (lane change), 10 (street change), 17 (street change), 18 (unknown),19 (street change)

Figure 17: Flags for nodes. [2]

angle = cos−1(h · hcar||h||||hcar||

)

For toMarking LL, toMarking ML, toMarking MR, toMarking RR, we will assumethat lanes are straight lines. We have a point on those lines and a vector indicating thedirection they are heading. This assumption is crude, however at the distances we arediscussing it should not produce large errors. Additionally, we could adjust the distance

31

Figure 18: Blue line represents where we want to collect data on lane location.

32

at which we sample date based on road heading. This would not only be more in linewith human behavior [10] [20] [19], it would also reduce errors. To compute the distancewe must project vector f − LL on to vector −h and compute the distance between theprojected point and f − LL. We will work out the mathematics for the left marking ofthe left lane, LL.

r = proj−h(f − LL) =(f − LL) · (−h)

|| − h||2− h

toMarking LL = ||(f − LL)− r||

To compute dist LL, dist MM, dist RR, we must first figure out which vehicles are inwhich lanes. For all the vehicles returned by GetNearbyVehicles, we can eliminate anywhose heading vector form an angle of more than 90 degrees with the heading of theroad. The position of the vehicle, p, must be within a rectangular prism formed by LL,RR, f and f + d ∗ h in the direction normal to the ground which is also other world upvector for the test car, vup. This can be computed by projecting LL − f , RR − f , andd ∗ h onto the plane of f and vup. The following are the projections of the points.

rLL = LL− projvup(LL) =LL · vup||vup||2

vup

rRR = RR− projvup(RR) =RR · vup||vup||2

vup

rf+d∗h = (f + d ∗ h)− projvup((f + d ∗ h)) =(f + d ∗ h) · vup||vup||2

vup

rp = (p)− projvup((p)) =(p) · vup||vup||2

vup

Now we just have to check that the y coordinate of rp is between the y coordinates ofrLL and rRR and the x coordinate of rp is between 0 and the x coordinate of rf+d∗h. Ifthe vehicle satisfies these bounds, we can compute its distance to all lane marking in thesame way we did for the test vehicle. We then check to which marking it is closest toand assign it to that lane, or perform additional logic. Let assume it is the left lane. Weperform the following to compute dist LL.

r = projhcar(p− f) =(p− f) · (hcar)||hcar||2

hcar

dist LL = ||r||

33

Algorithm 5 Algorithm for computing dist LL, dist MM, and dist RR.

Require: entity, b = {(xmin, ymin), (xmax, ymax)}1: create arrays dist LLs, dist MMs, and dist RRs and add d to each2: l is the lane of the vehicle3: for each vehicle v returned by GetNearbyVehicles do4: if cos−1( h·hcar

||h||||hcar||) <π2then

5: if p is in the three lanes, in front of test car, and close then6: compute toMarking LL, toMarking ML, toMarking MR, and toMark-

ing RR for p7: if toMarking LL is smallest then8: l = left lane9: if toMarking RR is smallest then10: l = right lane

11: if toMarking ML is smallest AND toMarking LL < toMarking MR then12: l = left lane13: else14: l = middle lane15: if toMarking MR is smallest AND toMarking RR < toMarking MR then16: l = right lane17: else18: l = middle lane19: if l = right lane then20: add ||projhcar(p− f)|| to dist RRs21: else if l = left lane then22: add ||projhcar(p− f)|| to dist LLs23: else24: add ||projhcar(p− f)|| to dist MMs

25: dist RR = min dist RRs26: dist LL = min dist LLs27: dist MM = min dist MMs

34

To perform the above computation we need a vector representing the heading of theroad and a point on each lane marking. This is where the challenge begins. We cannot useany of the functions or methods discussed for objects because roads and lane markingsare not entities. The road is part of the terrain and the lanes are a texture. Therefore,we cannot get the width of the road model or the position of a lane marking the way weobtained those properties for cars.

GTA 5 has realistic traffic. There are many AI driven cars in the game which navigatethe road network while staying in lanes. Therefore, the game engine knows the locationof the lane markings. There are several functions which pertain to roads. GetStreetNamereturns the name of a street at a specified point in the world. IS POINT ON ROAD isa native function which checks if a point in on a road. There are also several functionswhich deal with vehicle nodes.

Vehicle nodes appear to be the primary way the graph of the road network is repre-sented in the game. Every vehicle node is a point at the center of the road, as seen inFigure 19. They are spaced out in proportion to the curvature of the road; close togetherat sharp corners and further apart on straight stretches of road. Each node has an uniqueid.

The main functions for working with nodes include GET NTH CLOSEST VEHICLE NODEand GET NTH CLOSEST VEHICLE NODE ID. A way to call them in a script is shownin the code snippet below. In this code snippet, the ”safe” arguements serve an unknownpurpose, as do the two zeros in GET NTH CLOSEST VEHICLE NODE ID. The i vari-able specified which node in the order of proximity should be selected. There is also afunction GET VEHICLE NODE PROPERTIES. However, we could not find a way toget this function to work.

OutputArgument safe1 = new OutputArgument();

OutputArgument safe2 = new OutputArgument();

OutputArgument safe3 = new OutputArgument();

Vector3 midNode;

OutputArgument outPosArg = new OutputArgument();

Function.Call(Hash.GET_NTH_CLOSEST_VEHICLE_NODE,

playerPos.X, playerPos.Y, playerPos.Z, i, outPosArg, safe1, safe2, safe3);

midNode = outPosArg.GetResult<Vector3>();

int nodeId = Function.Call<int>(Hash.GET_NTH_CLOSEST_VEHICLE_NODE_ID,

playerPos.X, playerPos.Y, playerPos.Z, i, safe1, 0f, 0f);

35

Figure 19: Red markers represent locations of vehicle nodes.

36

The benefit of this system is that we can locate our car on the network by gettingthe closest node. Given the road heading and lane width, it is possible to compute thecenters of lanes as seen in Figure 20. The problem is that there is no way of getting theheading of the road and the number as well position of lanes around the node as far aswe could find.

Figure 20: Red markers represent locations of vehicle nodes. Blue markers are extrapo-lations of lane middles based on road heading and lane width.

A promising approach to solving this problem was road model fitting. We know thatthe node is at the center of the road. We do not know if it is on a lane marking orin a middle of a lane. We could assume that it is on a lane marking and then countthe number of lanes on the left and right. This could be done by moving a lane widthover and checking if the point is still on the road by using IS POINT ON ROAD andGetStreetName. We can repeat the same method under the assumption that the node

37

is in the middle of a lane. Whichever assumption found more lanes, that is the correctassumption as the wrong one will not count the outer most lanes. This still leaves thequestion of finding the heading of the road and if the node is between lanes going inopposite directions. However, there are two fundamental problems with this approachwhich make it useless. First, this approach assumes that the nodes at centers of lanesor on lane markings. Upon further exploration, we found that nodes can be on medians;for example Figure 21. This is still the center of the road, just not where we expect it.Second, IS POINT ON ROAD is not a reliable indicator of whether a point is actuallyon a road. Sometimes is returns false for points which are clearly on the road, andsometimes it returns true for points which are on the side of the road.

Figure 21: Red markers represent locations of vehicle nodes. Blue markers are extrapo-lations of lane middles based on road heading and lane width. The blue marker in frontof the test car represents where we want to measure lanes.

38

There are two solutions to this problem. The first solution is to keep hacking at thegame until we find all of this information. The information we are looking for must besomewhere in the game because the game AI knows where to drive. It knows where thelanes are and how to stay in them. The second solution is to build a database of nodes.Figure 22 lists the data which would be stored in this database.

Field MeaningnodeId The numerical id of the node.

onMarking True if the node is on a lane marking or in the middle of a lane.oneWay True if the traffic on both sides of the node moves in the same direction.leftStart Vector representing the point where the road begins left of the node.leftEnd Vector representing the point where the road ends left of the node.

rightStart Vector representing the point where the road begins right of the node.rightEnd Vector representing the point where the road ends right of the node.heading Vector representing the heading of the road.

Figure 22: Node database entry design.

The problem with this method is that there are over 70,000 nodes and there does notappear to be an easy way of collecting this information. At the present moment, theredoes not appear to be a simpler solution.

39

4 Towards The Ultimate AI Machine

Previous section outlined methods for getting information out of GTA 5 to createdatasets. To fully utilize GTA 5, we still need to create a database of nodes and spacesof interest. Once that is done, we will move on to creating datasets and training neuralnetworks.

The objective of harvesting this data has be emphasized as training data for neuralnetworks. However, the ultimate goal is much grander. That is building a system whichcan master driving in GTA 5. This system would probably include several neural networksand perhaps other statistical models. For example, this system may include a network forlocating pedestrians, a SVM for classifying street signs, another network for recognizingtraffic lights etc. All of these components would be linked together by some masterprogram that would construct the most likely world model based on all of these ”sensors”.Then another program would be responsible for driving the car. Since we can extractdata from GTA 5 in real time, we can test how well this system would work in changingconditions.

In the process of building such a system, it is possible to test out some new ideasin neural networks. We would like to continue to explore curriculum learning [5] andself-paced learning [18] [16] as means of presenting examples in order of difficult. Sincethese ideas have been applied to object tracking in video [28], teaching robots motor skills[17], matrix factorization [31], handwriting recognition [22], and multi-task learning [26];surpassing state-of-the-art benchmarks, we hope that they could be used to improveautonomous driving. Another interesting idea is transfer learning [25], or the abilityto use a network trained in one domain in another domain. This could be applied inpedestrian and sign classifiers. Lastly, we have been working on ways to use optimallearning to select best neural network architectures. It would be interesting to try thosemethods in this application.

Building this system presents 2 major difficulties. First, both the game and the neuralnetworks are GPU intensive processes. Running both on a single machine would require alot of computational power. Second, GTA 5 will only work on Windows PCs, while mostdeep learning libraries are Linux based. Porting either application is close to infeasible.Last semester, we, working with Daniel Stanley and Bill Zhang, constructed a solutionfor running GTA 5 with TorcsNet from [8]. The idea was to run the processes on separatemachines and have them communicate via a shared folder on a local network, see Figure23. During the tests, the amount of data transfered was small, a text file of 13 floats anda 280 by 210 png image. This setup should is fast enough for the system run at around10 Hz.

40

Figure 23: GTA V Experimental Setup

4.1 Future Research Goals

Build a database of GTA V road nodes

Build a database of GTA V road signs

Train sign classifier

Train traffic lights classifier

Compare how well GTA V trained classifier works on real datasets

Check how well the TORCS network can identify cars in GTA V

Build a robust controller in GTA V which uses all 13 indicators

Explore the effects of curriculum learning on driving performance

Explore transfer learning and optimal learning for neural networks

Test trained models in a real vehicle (PAVE)

41

A Screenshot Function

private struct Rect

{

public int Left;

public int Top;

public int Right;

public int Bottom;

}

[DllImport("C:\\Windows\\System32\\user32.dll")]

private static extern IntPtr GetForegroundWindow();

[DllImport("C:\\Windows\\System32\\user32.dll")]

private static extern IntPtr GetClientRect(IntPtr hWnd, ref Rect rect);

[DllImport("C:\\Windows\\System32\\user32.dll")]

private static extern IntPtr ClientToScreen(IntPtr hWnd, ref Point point);

void screenshot(String filename)

{

//UI.Notify("Taking screenshot?");

var foregroundWindowsHandle = GetForegroundWindow();

var rect = new Rect();

GetClientRect(foregroundWindowsHandle, ref rect);

var pTL = new Point();

var pBR = new Point();

pTL.X = rect.Left;

pTL.Y = rect.Top;

pBR.X = rect.Right;

pBR.Y = rect.Bottom;

ClientToScreen(foregroundWindowsHandle, ref pTL);

ClientToScreen(foregroundWindowsHandle, ref pBR);

Rectangle bounds = new Rectangle(pTL.X, pTL.Y, rect.Right - rect.Left,

rect.Bottom - rect.Top);

using (Bitmap bitmap = new Bitmap(bounds.Width, bounds.Height))

{

using (Graphics g = Graphics.FromImage(bitmap))

{

42

g.ScaleTransform(.2f, .2f);

g.CopyFromScreen(new Point(bounds.Left, bounds.Top), Point.Empty, bounds.Size);

}

Bitmap output = new Bitmap(IMAGE_WIDTH, IMAGE_HEIGHT);

using (Graphics g = Graphics.FromImage(output))

{

g.DrawImage(bitmap, 0, 0, IMAGE_WIDTH, IMAGE_HEIGHT);

}

output.Save(filename, ImageFormat.Bmp);

}

}

43

References

[1] Lane width. http://safety.fhwa.dot.gov/geometric/pubs/

mitigationstrategies/chapter3/3_lanewidth.cfm. Accessed: 2016-4-29.

[2] Paths (gta v). http://gta.wikia.com/wiki/Paths_(GTA_V). Accessed: 2016-4-29.

[3] F. H. Administration. Manual on uniform traffic control devices. 2009.

[4] A. R. Atreya, B. C. Cattle, B. M. Collins, B. Essenburg, G. H. Franken, A. M. Saxe,S. N. Schiffres, and A. L. Kornhauser. Prospect eleven: Princeton university’s entryin the 2005 darpa grand challenge. Journal of Field Robotics, 23(9):745–753, 2006.

[5] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages41–48. ACM, 2009.

[6] C. G. C. D. R. C. A. S. Bernhard Wymann, Eric Espie. Torcs the open racing carsimulator. http://www.torcs.org, 2014.

[7] S. M. Bileschi. StreetScenes: Towards scene understanding in still images. PhDthesis, Citeseer, 2006.

[8] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving: Learning affordance fordirect perception in autonomous driving. arXiv preprint arXiv:1505.00256, 2015.

[9] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark.In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conferenceon, pages 304–311. IEEE, 2009.

[10] E. Donges. A two-level model of driver steering behavior. Human Factors: TheJournal of the Human Factors and Ergonomics Society, 20(6):691–707, 1978.

[11] F. Flohr, D. M. Gavrila, et al. Pedcut: an iterative framework for pedestrian seg-mentation combining shape models and multiple data cues. 2013.

[12] J. Fritsch, T. Kuehnl, and A. Geiger. A new performance measure and evaluationbenchmark for road detection algorithms. In International Conference on IntelligentTransportation Systems (ITSC), 2013.

[13] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kittivision benchmark suite. In Conference on Computer Vision and Pattern Recognition(CVPR), 2012.

[14] E. Guizzo. How googles self-driving car works. IEEE Spectrum Online, October, 18,2011.

44

[15] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel. Detection of trafficsigns in real-world images: The German Traffic Sign Detection Benchmark. InInternational Joint Conference on Neural Networks, number 1288, 2013.

[16] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. Hauptmann. Self-paced learningwith diversity. In Advances in Neural Information Processing Systems, pages 2078–2086, 2014.

[17] A. Karpathy and M. Van De Panne. Curriculum learning for motor skills. InAdvances in Artificial Intelligence, pages 325–330. Springer, 2012.

[18] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variablemodels. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta,editors, Advances in Neural Information Processing Systems 23, pages 1189–1197.Curran Associates, Inc., 2010.

[19] M. Land, J. Horwood, et al. Which parts of the road guide steering? Nature,377(6547):339–340, 1995.

[20] M. F. Land and D. N. Lee. Where do we look when we steer. Nature, 1994.

[21] F. Larsson and M. Felsberg. Using fourier descriptors and spatial models for trafficsign recognition. In Image Analysis, pages 238–249. Springer, 2011.

[22] J. Louradour and C. Kermorvant. Curriculum learning for handwritten text linerecognition. In Document Analysis Systems (DAS), 2014 11th IAPR InternationalWorkshop on, pages 56–60. IEEE, 2014.

[23] M. Mathias, R. Timofte, R. Benenson, and L. Van Gool. Traffic sign recognitionhowfar are we from the solution? In Neural Networks (IJCNN), The 2013 InternationalJoint Conference on, pages 1–8. IEEE, 2013.

[24] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund. Vision-based traffic sign detec-tion and analysis for intelligent driver assistance systems: Perspectives and survey.Intelligent Transportation Systems, IEEE Transactions on, 13(4):1484–1497, 2012.

[25] S. J. Pan and Q. Yang. A survey on transfer learning. Knowledge and Data Engi-neering, IEEE Transactions on, 22(10):1345–1359, 2010.

[26] A. Pentina, V. Sharmanska, and C. H. Lampert. Curriculum learning of multipletasks. arXiv preprint arXiv:1412.1353, 2014.

[27] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations byback-propagating errors. Cognitive modeling, 5(3):1, 1988.

45

[28] J. S. Supancic and D. Ramanan. Self-paced learning for long-term tracking. In Com-puter Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages2379–2386. IEEE, 2013.

[29] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong,J. Gale, M. Halpenny, G. Hoffmann, et al. Stanley: The robot that won the darpagrand challenge. Journal of field Robotics, 23(9):661–692, 2006.

[30] T. Veit, J.-P. Tarel, P. Nicolle, and P. Charbonnier. Evaluation of road mark-ing feature extraction. In Proceedings of 11th IEEE Conference on Intelli-gent Transportation Systems (ITSC’08), pages 174–181, Beijing, China, 2008.http://perso.lcpc.fr/tarel.jean-philippe/publis/itsc08.html.

[31] Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, and A. G. Hauptmann. Self-pacedlearning for matrix factorization. In Twenty-Ninth AAAI Conference on ArtificialIntelligence, 2015.

46