MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and Visualization Using...

Preview:

DESCRIPTION

 

Citation preview

Mongo Analytics – Learn aggregation by example

Exploratory Analytics and Visualization using Flight Data

www.jsonstudio.com

Analyzing Flight Data

• JSON data imported from CSV downloaded from:http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

• Will build aggregation pipelines and visualize data using JSON Studio (www.jsonstudio.com)

• Sample document for a flight:

{ "_id": { "$oid": "534205f61c479f6149a92709" }, "YEAR": 2013, "QUARTER": 1, "MONTH": 1, "DAY_OF_MONTH": 18, "DAY_OF_WEEK": 5, "FL_DATE": "2013-01-18”, "UNIQUE_CARRIER": "DL”, "AIRLINE_ID": 19790, "CARRIER": "DL", "TAIL_NUM": "N325US”, "FL_NUM": 1497, "ORIGIN_AIRPORT_ID": 14100, "ORIGIN_AIRPORT_SEQ_ID": 1410002, "ORIGIN_CITY_MARKET_ID": 34100, "ORIGIN": "PHL", "ORIGIN_CITY_NAME": "Philadelphia, PA", "ORIGIN_STATE_ABR": "PA”, "ORIGIN_STATE_FIPS": 42, "DEST_AIRPORT_ID": 13487, "DEST_AIRPORT_SEQ_ID": 1348702, "DEST_CITY_MARKET_ID": 31650, "DEST": "MSP", "DEST_CITY_NAME": "Minneapolis, MN", "DEST_STATE_ABR": "MN", "DEST_STATE_FIPS": 27, "DEST_STATE_NM": "Minnesota", "DEST_WAC": 63, "CRS_DEP_TIME": 805, "DEP_TIME": 758,

"DEP_DELAY": -7, "DEP_DELAY_NEW": 0, "DEP_DEL15": 0, "DEP_DELAY_GROUP": -1, "DEP_TIME_BLK": "0800-0859", "TAXI_OUT": 24, "WHEELS_OFF": 822, "WHEELS_ON": 958, "TAXI_IN": 4, "CRS_ARR_TIME": 1015, "ARR_TIME": 1002, "ARR_DELAY": -13, "ARR_DELAY_NEW": 0, "ARR_DEL15": 0, "ARR_DELAY_GROUP": -1, "ARR_TIME_BLK": "1000-1059", "CANCELLED": 0, "CANCELLATION_CODE": "",

"DIVERTED": 0, "CRS_ELAPSED_TIME": 190, "ACTUAL_ELAPSED_TIME": 184, "AIR_TIME": 156, "FLIGHTS": 1, "DISTANCE": 980, "DISTANCE_GROUP": 4, "CARRIER_DELAY": "", "WEATHER_DELAY": "", "NAS_DELAY": "", "SECURITY_DELAY": "", "LATE_AIRCRAFT_DELAY": "", "FIRST_DEP_TIME": "", "TOTAL_ADD_GTIME": "", "LONGEST_ADD_GTIME": "", "": "" }

MongoDB aggregation steps/stages

• Grouping• Matching/filtering• Projection• Sorting• Unwind• Limit, skip• Added in 2.6– Out– Redact

Who are the largest carriers?

Which airports have the most cancellations?

Which carriers are most at fault for cancellations?

Arrival delays by distance

Delays by distance by carrier

Delays by distance by carrier – long haul only

Order Does Matter

An example for $unwind

• DB: companies• Collection: companies• Pipeline: movers

Hub airports – try1

Hub airports – try2

Hub airports – try 3

{ $group: { _id: { ORIGIN: "$ORIGIN", CARRIER: "$CARRIER" }, count: { $sum: 1 } } }, { $project: { airport: "$_id.ORIGIN", carrier: "$_id.CARRIER", "count": 1 } }, { $match: { "count": { $gte: "$$hub_threshold" } } }, { $group: {

_id: { airport: "$airport" }, airlines: { $sum: 1 }, flights: { $sum: "$count" }, avg_airline: { $avg: "$count" }, max_airline: { $max: "$count" } } },

{ $project: { "airlines": 1, "flights": 1, "avg_airline": 1, "max_airline": 1, "avg_no_max": { $divide: [ { $subtract: [ "$flights", "$max_airline" ] }, "$airlines" ] } } },

{ $sort: { "flights": -1 } }

Hub airports

Visualizing route data (from/to & main routes)

{ $group: { _id: { UNIQUE_CARRIER: "$UNIQUE_CARRIER", ORIGIN: "$ORIGIN", DEST: "$DEST" }, count: { $sum: 1 } } },

{ $match: { "count": { $gt: "$$count_threshold" } } }

Hub visualization (using routes – from/to, $$count=1, origin treemap)

From-to Insensitive{ $group: { _id: { UNIQUE_CARRIER: "$UNIQUE_CARRIER", ORIGIN: "$ORIGIN",

DEST: "$DEST" }, count: { $sum: 1 } } }, { $match: { "count": { $gt: "$$count_threshold" } } },

{ $project: { _id_UNIQUE_CARRIER: "$_id.UNIQUE_CARRIER", "count": 1,

rroute: { $cond: [

{ $lt: [ { $cmp: [ "$_id.ORIGIN", "$_id.DEST" ] }, 0 ] }, { $concat: [ "$_id.ORIGIN", "$_id.DEST" ] }, { $concat: [ "$_id.DEST", "$_id.ORIGIN" ] }

] } } },

{ $group: { _id: { _id_UNIQUE_CARRIER: "$_id_UNIQUE_CARRIER", rroute: "$rroute" }, _sum_count: { $sum: "$count" } } }

All routes treemap – airline fragmentation (rroute treemap)

New in 2.6 aggregation

• Cursors• MaxTimeMS• $out• $redact• $let/$map• Misc - $cond with either object or array, set

expression, ..

Quiz - NYC Airports

1. Import data – mongoimport (flights, airports, planes)2. Make join collection:

3. Query-Insert for all NYC data - ~500K flights per year originating in JFK/EWR/LGA

{ "Year": 2008, … "LateAircraftDelay": "NA", "to": { "city": "Detroit", "state": "MI", "lat": 42.21205889, "long": -83.34883583 }, "from": { "city": "Newark", "state": "NJ", "lat": 40.69249722, "long": -74.16866056 }, "aircraft": { "manufacturer": "EMBRAER", "model": "EMB-145EP", "year": 1997 }}

NYC Flights – Quiz Questions• Of the three airports, who has the most flights?

– Nyc1

• Who has the most cancellations and highest cancellation ratio?– Nyc2

• Taxi in/out times?– Nyc3

• What about delays?– Nyc4

• How do delays differ by month?– Nyc5 + nyc5– (summer vs. winter / bubble size vs. y-axis)

• What about weather delays only? Which months are worse? Are the three airports equivalent?– Nyc7 + nyc7

• Where can I fly to if I work for Boeing and am very loyal (and on which aicraft)?– Nyc8 + map

www.jsonstudio.com

Discount code: MUGAU

ron@jsonar.com

Recommended