Processing Planetary Sized Datasets

Processing Planetary Sized DatasetsTim Park @timpark

Vehicle Location Trace Dataset Vehicle Id Trip Id Timestamp Latitude Longitude Altitude

…1015287576688840

65716963

9144537762300

036.966819 -122.012298 1809

10152875766888406

57169639

1445377625000

36.966845 -122.012248 1809

10152875766888406

57169639

1445377627000

36.966877 -122.012228 1814

10152875766888406

57169639

1445377629000

36.966913 -122.012236 1814

10152875766888406

57169639

1445377630000

36.966946 -122.012236 1814

10152875766888406

57169639

1445377631000

36.966984 -122.012263 1815

10152875766888406

57169639

1445377632000

36.967027 -122.012281 1815

…

Vehicle Monthly Dataset Slice

39 TB of raw location data:• 584 billion data points• 116 million trips

Location Storage

• Many trips per vehicle.• Want to be

able to pull a range of locations by timestamp for trip display.

Trip Locations => Data Range QueryVehicle Id Timestamp Latitude Longitude

…10152875766888406 144537442300

036.966819 -122.012298

10152875766888406 1445377625000

36.966845 -122.012248

10152875766888406 1445377627000

36.966877 -122.012228

10152875766888406 1445377629000

36.966913 -122.012236

10152875766888406 1445377630000

36.966946 -122.012236

10152875766888406 1445377631000

36.966984 -122.012263

10152875766888406 1445379512000

36.967027 -122.012281

…

Location Storage Options

This is a challenge with a large dataset:• A traditional relational database typically

requires hand sharding to scale to PBs of data (eg. Postgres).• Highly indexed non relational solutions can

be very expensive (eg. MongoDB).• Lightly indexed solutions are a good fit

because we really only have one query we need to execute against the data. (HBase, Cassandra, and Azure Table Storage)

Pattern 1: Use lightly structured storage

PartitionKey (vehicleId)

RowKey (timestamp)

Latitude Longitude

10152875766888406

1445377623000 36.966819 -122.012298

10152875766888406

1445377625000 36.966845 -122.012248

10152875766888406

1445377627000 36.966877 -122.012228

10152875766888406

1445377629000 36.966913 -122.012236

10152875766888406

1445377630000 36.966946 -122.012236

…

Trip Storage

• Want to query a set of trip in a bounding box.

• Also want to filter activities based on distance and duration.

Trip Data Schema

Trip Id start (sec) finish (sec) distance (m)

duration (m)

bbox (geometry)

101528 1445377625

1445383025

50023 6222 [-104.990, 39.7392...

101643 1445362577

1445373616

28778 2498 [-122.01228, 36.96…

101843 1445377627

1445382432

4629 701 [0.1278, 51.5074 …

101901 1445362577

1445374713

99691 14232 [139.6917, 35.699...

102102 1445374713

1445374713

25259 6657 [1.3521, 103.8129…

Pattern 2: Use “polyglot persistence”

user Id timestamp

latitude longitude

10152875766888406

1445377623

36.966819

-122.012298

10152875766888406

1445377625

36.966845

-122.012248

…

10152875766888406

1445383025

36.966913

-122.012236

10152875766888406

1445383030

36.966946

-122.012236

activity id

start finish … bbox

101528 1445362577

1445373616

… [-104.990, 39.7392...

101643 1445377625

1445383025

… [-122.01228, 36.96…

101843 1445377627

1445382432

… [0.1278, 51.5074 …

101901 1445362577

1445374713

… [139.6917, 35.699...

102102 1445374713

1445374713

… [1.3521, 103.8129…

Location Data(Azure Table

Storage)

Trip Data(Postgres + PostGIS)

Usage Heatmap

Heatmap Generation

• Total number of location samples in a geographical area.• Whole

dataset operation.

Pattern 3: XYZ Tiles for summarization• Divides world

up into tiles.• Each tile has

four children at the next higher zoom level.• Maps 2

dimension space to 1 dimension.

2_0_0 1_0_1

3_3_2

Apache Spark

• Can think of it is as “Hadoop the Next Generation”• Better performance (10-100x)• Cleaner programming model

• Used HDInsight Spark (Azure) to avoid operational difficulties of running our own Spark cluster.

Heatmap Spark MapperFor each location, map to tiles at every zoom level:

(36.9741, -122.0308) [(10_398_164, 1), (11_797_329, 1)

(12_1594_659, 1), (13_3189_1319, 1), (14_6378_2638, 1),(15_12757_5276,1), (16_25514_10552, 1), (17_51028_21105, 1), (18_102057_42211, 1)]

Heatmap Spark AlgorithmReduce all these mappings with the same key into an aggregate value:

(10_398_164, 151) [(10_398_164, 1), (10_398_164, 1), …

(10_398_164, 1), (10_398_164, 1), … (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1)]

Heatmap Spark Mapper def tile_id_mapper(location): tileMappings = [] tileIds = Tile.tile_ids_for_zoom_levels( location['latitude'], location['longitude'], MIN_ZOOM_LEVEL, MAX_ZOOM_LEVEL ) for tileId in tileIds: tileMappings.append( (tileId, 1) ) return tileMappings

Heatmap Spark

lines = sc.textFile('wasb://[email protected]/') locations = lines.flatMap(json_loader)heatmap = locations .flatMap(tile_id_mapper)

.reduceByKey(lambda agg1,agg2: agg1+agg2)

heatmap.saveAsTextFile('wasb://[email protected]/');

Building the heatmap then boils down to this in Spark:

Spark Shuffle

Pattern 4: Incremental Ingestion2016-04-28

17:00

2016-04-28 16:00

2016-04-28 15:00

2016-04-28 14:00

2016-04-28 13:00

2016-04-28 12:00

Trip

Trip

Trip

Trip

Trip

… AzureTable

Storage

(Hbase)

Appl

icatio

n AP

I AzureEvent Hub

(Kafka)

AzureStream

Analytics

Pattern 5: Data Slice Processing

2016-04-28 17:00

2016-04-28 16:00

2016-04-28 15:00

2016-04-28 14:00

2016-04-28 13:00

…

2016-04-28 17:00

Heatmap Partial

ExistingHeatmap

NewHeatmap

Lambda Architecture: Speed Layer

Tweet

Tweet

Tweet

Tweet

Tweet

…

IntersectionService

Azure Functions

Bin Jawad +1

Benghazi +1

Libya +1

…

Geolocated Tweets

Summary Updates

Features

Sirte +1

Libya +1

Displaying Heatmaps

Pattern 6: Precomputing Heatmaps

Pattern 6: Precomputing Data Views2016-04-28

17:00Heatmap

Deltas

PreviousHeatmap

NewHeatma

pUpdates

6_25_31

Appl

icatio

n AP

I

9_201_249

9_201_250

9_201_248

9_201_245

9_201_247

8_100_124

8_100_125

8_100_126

7_50_62

7_50_63

7_50_64

Pattern 8: Use binary encoding

39 TB 23 TB

JSON Avro

60% Smaller

Open Source

• geotile: http://github.com/timfpark/geotile• XYZ tile math in C#, JavaScript, and

Python• heatmap:

http://github.com/timfpark/heatmap• Spark code for building heatmaps

• tileIndex: http://github.com/timfpark/tileIndexPusher• Azure Function for pushing tile indexes.

http://github.com/timfpark/geotile

http://github.com/timfpark/heatmap

http://github.com/timfpark/tileIndexPusher

Questions?

Tim Park @timpark

Software

Processing Planetary Sized Datasets