Upload
tim-park
View
130
Download
0
Embed Size (px)
Citation preview
Processing Planetary Sized DatasetsTim Park @timpark
Vehicle Location Trace Dataset Vehicle Id Trip Id Timestamp Latitude Longitude Altitude
…1015287576688840
65716963
9144537762300
036.966819 -122.012298 1809
10152875766888406
57169639
1445377625000
36.966845 -122.012248 1809
10152875766888406
57169639
1445377627000
36.966877 -122.012228 1814
10152875766888406
57169639
1445377629000
36.966913 -122.012236 1814
10152875766888406
57169639
1445377630000
36.966946 -122.012236 1814
10152875766888406
57169639
1445377631000
36.966984 -122.012263 1815
10152875766888406
57169639
1445377632000
36.967027 -122.012281 1815
…
Vehicle Monthly Dataset Slice
39 TB of raw location data:• 584 billion data points• 116 million trips
Location Storage
• Many trips per vehicle.• Want to be
able to pull a range of locations by timestamp for trip display.
Trip Locations => Data Range QueryVehicle Id Timestamp Latitude Longitude
…10152875766888406 144537442300
036.966819 -122.012298
10152875766888406 1445377625000
36.966845 -122.012248
10152875766888406 1445377627000
36.966877 -122.012228
10152875766888406 1445377629000
36.966913 -122.012236
10152875766888406 1445377630000
36.966946 -122.012236
10152875766888406 1445377631000
36.966984 -122.012263
10152875766888406 1445379512000
36.967027 -122.012281
…
Location Storage Options
This is a challenge with a large dataset:• A traditional relational database typically
requires hand sharding to scale to PBs of data (eg. Postgres).• Highly indexed non relational solutions can
be very expensive (eg. MongoDB).• Lightly indexed solutions are a good fit
because we really only have one query we need to execute against the data. (HBase, Cassandra, and Azure Table Storage)
Pattern 1: Use lightly structured storage
PartitionKey (vehicleId)
RowKey (timestamp)
Latitude Longitude
10152875766888406
1445377623000 36.966819 -122.012298
10152875766888406
1445377625000 36.966845 -122.012248
10152875766888406
1445377627000 36.966877 -122.012228
10152875766888406
1445377629000 36.966913 -122.012236
10152875766888406
1445377630000 36.966946 -122.012236
…
Trip Storage
• Want to query a set of trip in a bounding box.
• Also want to filter activities based on distance and duration.
Trip Data Schema
Trip Id start (sec) finish (sec) distance (m)
duration (m)
bbox (geometry)
101528 1445377625
1445383025
50023 6222 [-104.990, 39.7392...
101643 1445362577
1445373616
28778 2498 [-122.01228, 36.96…
101843 1445377627
1445382432
4629 701 [0.1278, 51.5074 …
101901 1445362577
1445374713
99691 14232 [139.6917, 35.699...
102102 1445374713
1445374713
25259 6657 [1.3521, 103.8129…
Pattern 2: Use “polyglot persistence”
user Id timestamp
latitude longitude
10152875766888406
1445377623
36.966819
-122.012298
10152875766888406
1445377625
36.966845
-122.012248
…
10152875766888406
1445383025
36.966913
-122.012236
10152875766888406
1445383030
36.966946
-122.012236
activity id
start finish … bbox
101528 1445362577
1445373616
… [-104.990, 39.7392...
101643 1445377625
1445383025
… [-122.01228, 36.96…
101843 1445377627
1445382432
… [0.1278, 51.5074 …
101901 1445362577
1445374713
… [139.6917, 35.699...
102102 1445374713
1445374713
… [1.3521, 103.8129…
Location Data(Azure Table
Storage)
Trip Data(Postgres + PostGIS)
Usage Heatmap
Heatmap Generation
• Total number of location samples in a geographical area.• Whole
dataset operation.
Pattern 3: XYZ Tiles for summarization• Divides world
up into tiles.• Each tile has
four children at the next higher zoom level.• Maps 2
dimension space to 1 dimension.
2_0_0 1_0_1
3_3_2
Apache Spark
• Can think of it is as “Hadoop the Next Generation”• Better performance (10-100x)• Cleaner programming model
• Used HDInsight Spark (Azure) to avoid operational difficulties of running our own Spark cluster.
Heatmap Spark MapperFor each location, map to tiles at every zoom level:
(36.9741, -122.0308) [(10_398_164, 1), (11_797_329, 1)
(12_1594_659, 1), (13_3189_1319, 1), (14_6378_2638, 1),(15_12757_5276,1), (16_25514_10552, 1), (17_51028_21105, 1), (18_102057_42211, 1)]
Heatmap Spark AlgorithmReduce all these mappings with the same key into an aggregate value:
(10_398_164, 151) [(10_398_164, 1), (10_398_164, 1), …
(10_398_164, 1), (10_398_164, 1), … (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1)]
Heatmap Spark Mapper def tile_id_mapper(location): tileMappings = [] tileIds = Tile.tile_ids_for_zoom_levels( location['latitude'], location['longitude'], MIN_ZOOM_LEVEL, MAX_ZOOM_LEVEL ) for tileId in tileIds: tileMappings.append( (tileId, 1) ) return tileMappings
Heatmap Spark
lines = sc.textFile('wasb://[email protected]/') locations = lines.flatMap(json_loader)heatmap = locations .flatMap(tile_id_mapper)
.reduceByKey(lambda agg1,agg2: agg1+agg2)
heatmap.saveAsTextFile('wasb://[email protected]/');
Building the heatmap then boils down to this in Spark:
Spark Shuffle
Pattern 4: Incremental Ingestion2016-04-28
17:00
2016-04-28 16:00
2016-04-28 15:00
2016-04-28 14:00
2016-04-28 13:00
2016-04-28 12:00
Trip
Trip
Trip
Trip
Trip
… AzureTable
Storage
(Hbase)
Appl
icatio
n AP
I AzureEvent Hub
(Kafka)
AzureStream
Analytics
Pattern 5: Data Slice Processing
2016-04-28 17:00
2016-04-28 16:00
2016-04-28 15:00
2016-04-28 14:00
2016-04-28 13:00
…
2016-04-28 17:00
Heatmap Partial
ExistingHeatmap
NewHeatmap
Lambda Architecture: Speed Layer
Tweet
Tweet
Tweet
Tweet
Tweet
…
IntersectionService
Azure Functions
Bin Jawad +1
Benghazi +1
Libya +1
…
Geolocated Tweets
Summary Updates
Features
Sirte +1
Libya +1
Displaying Heatmaps
Pattern 6: Precomputing Heatmaps
Pattern 6: Precomputing Data Views2016-04-28
17:00Heatmap
Deltas
PreviousHeatmap
NewHeatma
pUpdates
6_25_31
Appl
icatio
n AP
I
9_201_249
9_201_250
9_201_248
9_201_245
9_201_247
8_100_124
8_100_125
8_100_126
7_50_62
7_50_63
7_50_64
Pattern 8: Use binary encoding
39 TB 23 TB
JSON Avro
60% Smaller
Open Source
• geotile: http://github.com/timfpark/geotile• XYZ tile math in C#, JavaScript, and
Python• heatmap:
http://github.com/timfpark/heatmap• Spark code for building heatmaps
• tileIndex: http://github.com/timfpark/tileIndexPusher• Azure Function for pushing tile indexes.
Questions?
Tim Park @timpark