56
Processing Planetary Sized Datasets Tim Park @timpark

Processing Planetary Sized Datasets

Embed Size (px)

Citation preview

Page 1: Processing Planetary Sized Datasets

Processing Planetary Sized DatasetsTim Park @timpark

Page 2: Processing Planetary Sized Datasets
Page 3: Processing Planetary Sized Datasets
Page 4: Processing Planetary Sized Datasets
Page 5: Processing Planetary Sized Datasets
Page 6: Processing Planetary Sized Datasets
Page 7: Processing Planetary Sized Datasets
Page 8: Processing Planetary Sized Datasets
Page 9: Processing Planetary Sized Datasets
Page 10: Processing Planetary Sized Datasets
Page 11: Processing Planetary Sized Datasets
Page 12: Processing Planetary Sized Datasets
Page 13: Processing Planetary Sized Datasets
Page 14: Processing Planetary Sized Datasets
Page 15: Processing Planetary Sized Datasets
Page 16: Processing Planetary Sized Datasets

Vehicle Location Trace Dataset Vehicle Id Trip Id Timestamp Latitude Longitude Altitude

…1015287576688840

65716963

9144537762300

036.966819 -122.012298 1809

10152875766888406

57169639

1445377625000

36.966845 -122.012248 1809

10152875766888406

57169639

1445377627000

36.966877 -122.012228 1814

10152875766888406

57169639

1445377629000

36.966913 -122.012236 1814

10152875766888406

57169639

1445377630000

36.966946 -122.012236 1814

10152875766888406

57169639

1445377631000

36.966984 -122.012263 1815

10152875766888406

57169639

1445377632000

36.967027 -122.012281 1815

Page 17: Processing Planetary Sized Datasets
Page 18: Processing Planetary Sized Datasets

Vehicle Monthly Dataset Slice

39 TB of raw location data:• 584 billion data points• 116 million trips

Page 19: Processing Planetary Sized Datasets

Location Storage

• Many trips per vehicle.• Want to be

able to pull a range of locations by timestamp for trip display.

Page 20: Processing Planetary Sized Datasets

Trip Locations => Data Range QueryVehicle Id Timestamp Latitude Longitude

…10152875766888406 144537442300

036.966819 -122.012298

10152875766888406 1445377625000

36.966845 -122.012248

10152875766888406 1445377627000

36.966877 -122.012228

10152875766888406 1445377629000

36.966913 -122.012236

10152875766888406 1445377630000

36.966946 -122.012236

10152875766888406 1445377631000

36.966984 -122.012263

10152875766888406 1445379512000

36.967027 -122.012281

Page 21: Processing Planetary Sized Datasets

Location Storage Options

This is a challenge with a large dataset:• A traditional relational database typically

requires hand sharding to scale to PBs of data (eg. Postgres).• Highly indexed non relational solutions can

be very expensive (eg. MongoDB).• Lightly indexed solutions are a good fit

because we really only have one query we need to execute against the data. (HBase, Cassandra, and Azure Table Storage)

Page 22: Processing Planetary Sized Datasets

Pattern 1: Use lightly structured storage

PartitionKey (vehicleId)

RowKey (timestamp)

Latitude Longitude

10152875766888406

1445377623000 36.966819 -122.012298

10152875766888406

1445377625000 36.966845 -122.012248

10152875766888406

1445377627000 36.966877 -122.012228

10152875766888406

1445377629000 36.966913 -122.012236

10152875766888406

1445377630000 36.966946 -122.012236

Page 23: Processing Planetary Sized Datasets

Trip Storage

• Want to query a set of trip in a bounding box.

• Also want to filter activities based on distance and duration.

Page 24: Processing Planetary Sized Datasets

Trip Data Schema

Trip Id start (sec) finish (sec) distance (m)

duration (m)

bbox (geometry)

101528 1445377625

1445383025

50023 6222 [-104.990, 39.7392...

101643 1445362577

1445373616

28778 2498 [-122.01228, 36.96…

101843 1445377627

1445382432

4629 701 [0.1278, 51.5074 …

101901 1445362577

1445374713

99691 14232 [139.6917, 35.699...

102102 1445374713

1445374713

25259 6657 [1.3521, 103.8129…

Page 25: Processing Planetary Sized Datasets

Pattern 2: Use “polyglot persistence”

user Id timestamp

latitude longitude

10152875766888406

1445377623

36.966819

-122.012298

10152875766888406

1445377625

36.966845

-122.012248

10152875766888406

1445383025

36.966913

-122.012236

10152875766888406

1445383030

36.966946

-122.012236

activity id

start finish … bbox

101528 1445362577

1445373616

… [-104.990, 39.7392...

101643 1445377625

1445383025

… [-122.01228, 36.96…

101843 1445377627

1445382432

… [0.1278, 51.5074 …

101901 1445362577

1445374713

… [139.6917, 35.699...

102102 1445374713

1445374713

… [1.3521, 103.8129…

Location Data(Azure Table

Storage)

Trip Data(Postgres + PostGIS)

Page 26: Processing Planetary Sized Datasets

Usage Heatmap

Page 27: Processing Planetary Sized Datasets
Page 28: Processing Planetary Sized Datasets

Heatmap Generation

• Total number of location samples in a geographical area.• Whole

dataset operation.

Page 29: Processing Planetary Sized Datasets

Pattern 3: XYZ Tiles for summarization• Divides world

up into tiles.• Each tile has

four children at the next higher zoom level.• Maps 2

dimension space to 1 dimension.

2_0_0 1_0_1

3_3_2

Page 30: Processing Planetary Sized Datasets

Apache Spark

• Can think of it is as “Hadoop the Next Generation”• Better performance (10-100x)• Cleaner programming model

• Used HDInsight Spark (Azure) to avoid operational difficulties of running our own Spark cluster.

Page 31: Processing Planetary Sized Datasets

Heatmap Spark MapperFor each location, map to tiles at every zoom level:

(36.9741, -122.0308) [(10_398_164, 1), (11_797_329, 1)

(12_1594_659, 1), (13_3189_1319, 1), (14_6378_2638, 1),(15_12757_5276,1), (16_25514_10552, 1), (17_51028_21105, 1), (18_102057_42211, 1)]

Page 32: Processing Planetary Sized Datasets

Heatmap Spark AlgorithmReduce all these mappings with the same key into an aggregate value:

(10_398_164, 151) [(10_398_164, 1), (10_398_164, 1), …

(10_398_164, 1), (10_398_164, 1), … (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1), (10_398_164, 1), … (10_398_164, 1)]

Page 33: Processing Planetary Sized Datasets

Heatmap Spark Mapper def tile_id_mapper(location): tileMappings = [] tileIds = Tile.tile_ids_for_zoom_levels( location['latitude'], location['longitude'], MIN_ZOOM_LEVEL, MAX_ZOOM_LEVEL ) for tileId in tileIds: tileMappings.append( (tileId, 1) ) return tileMappings

Page 34: Processing Planetary Sized Datasets

Heatmap Spark

lines = sc.textFile('wasb://[email protected]/') locations = lines.flatMap(json_loader)heatmap = locations .flatMap(tile_id_mapper)

.reduceByKey(lambda agg1,agg2: agg1+agg2)

heatmap.saveAsTextFile('wasb://[email protected]/');

Building the heatmap then boils down to this in Spark:

Page 35: Processing Planetary Sized Datasets

Spark Shuffle

Page 36: Processing Planetary Sized Datasets

Pattern 4: Incremental Ingestion2016-04-28

17:00

2016-04-28 16:00

2016-04-28 15:00

2016-04-28 14:00

2016-04-28 13:00

2016-04-28 12:00

Trip

Trip

Trip

Trip

Trip

… AzureTable

Storage

(Hbase)

Appl

icatio

n AP

I AzureEvent Hub

(Kafka)

AzureStream

Analytics

Page 37: Processing Planetary Sized Datasets

Pattern 5: Data Slice Processing

2016-04-28 17:00

2016-04-28 16:00

2016-04-28 15:00

2016-04-28 14:00

2016-04-28 13:00

2016-04-28 17:00

Heatmap Partial

ExistingHeatmap

NewHeatmap

Page 38: Processing Planetary Sized Datasets
Page 39: Processing Planetary Sized Datasets
Page 40: Processing Planetary Sized Datasets
Page 41: Processing Planetary Sized Datasets
Page 42: Processing Planetary Sized Datasets
Page 43: Processing Planetary Sized Datasets
Page 44: Processing Planetary Sized Datasets
Page 45: Processing Planetary Sized Datasets
Page 46: Processing Planetary Sized Datasets
Page 47: Processing Planetary Sized Datasets
Page 48: Processing Planetary Sized Datasets
Page 49: Processing Planetary Sized Datasets
Page 50: Processing Planetary Sized Datasets

Lambda Architecture: Speed Layer

Tweet

Tweet

Tweet

Tweet

Tweet

IntersectionService

Azure Functions

Bin Jawad +1

Benghazi +1

Libya +1

Geolocated Tweets

Summary Updates

Features

Sirte +1

Libya +1

Page 51: Processing Planetary Sized Datasets

Displaying Heatmaps

Page 52: Processing Planetary Sized Datasets

Pattern 6: Precomputing Heatmaps

Page 53: Processing Planetary Sized Datasets

Pattern 6: Precomputing Data Views2016-04-28

17:00Heatmap

Deltas

PreviousHeatmap

NewHeatma

pUpdates

6_25_31

Appl

icatio

n AP

I

9_201_249

9_201_250

9_201_248

9_201_245

9_201_247

8_100_124

8_100_125

8_100_126

7_50_62

7_50_63

7_50_64

Page 54: Processing Planetary Sized Datasets

Pattern 8: Use binary encoding

39 TB 23 TB

JSON Avro

60% Smaller

Page 55: Processing Planetary Sized Datasets

Open Source

• geotile: http://github.com/timfpark/geotile• XYZ tile math in C#, JavaScript, and

Python• heatmap:

http://github.com/timfpark/heatmap• Spark code for building heatmaps

• tileIndex: http://github.com/timfpark/tileIndexPusher• Azure Function for pushing tile indexes.

Page 56: Processing Planetary Sized Datasets

Questions?

Tim Park @timpark