The Best IoT Analytics with MongoDB Jake Angerman Sr. Solutions Architect MongoDB
Sessions:
1. Building an IoT Application that Will Work Next Year
2. Building IoT Applications the Right Way
3. The Best IoT Analytics with MongoDB Track Overview
✔
✔
Introduction
#MDBW16
Morpheus: time series data is everywhere
Morpheus picture
#MDBW16
Automatic Dependent Surveillance Broadcast (ADS-B)
Primary radar
Secondary Surveillance Radar
Software defined radio
1090 MHz
1030 MHz
1090 MHz
#MDBW16
Tin Can Reveal
homemade antenna (6.9mm quarter-wave whip)
NooElecNESDRMini2SDR $23.00USBextensioncable $10.00RFcableRG316femaletoMCXmale $5.50?ncan $2.87
Total: $41.37
6.9cm antenna
USB SDR
dump1090
#MDBW16
dump1090
#MDBW16
Antenna Range approximately 250 miles (400km)
> db.tincan.aggregate( [{ $geoNear: { near: { type: "Point", coordinates: [ center_y, center_x ] }, distanceField: "meters", minDistance: 394289, limit: 100, spherical: true }}, {$sort: { "meters": -1}}, {$limit: 1} ])
#MDBW16
Antenna Range approximately 250 miles (400km)
> db.tincan.aggregate( [{ $geoNear: { near: { type: "Point", coordinates: [ center_y, center_x ] }, distanceField: "meters", minDistance: 394289, limit: 100, spherical: true }}, {$sort: { "meters": -1}}, {$limit: 1} ])
#MDBW16
ADS-B BaseStation data format
MSG,7,111,11111,A3DC34,111111,2016/03/28,21:42:25.875,2016/03/28,21:42:25.865,,36975,,,,,,,,,,0 MSG,7,111,11111,A3DC34,111111,2016/03/28,21:42:25.884,2016/03/28,21:42:25.865,,36975,,,,,,,,,,0 MSG,8,111,11111,A33AA7,111111,2016/03/28,21:42:25.898,2016/03/28,21:42:25.865,,,,,,,,,,,,0 MSG,5,111,11111,A33AA7,111111,2016/03/28,21:42:25.961,2016/03/28,21:42:25.931,,28225,,,,,,,0,,0,0 MSG,3,111,11111,A678EF,111111,2016/03/28,21:42:26.013,2016/03/28,21:42:25.996,,34000,,,30.58369,-98.75438,,,,,,0 MSG,4,111,11111,A678EF,111111,2016/03/28,21:42:26.013,2016/03/28,21:42:25.996,,,417,283,,,0,,,,,0 MSG,3,111,11111,0D081C,111111,2016/03/28,21:42:26.280,2016/03/28,21:42:26.258,,35975,,,29.86456,-98.24018,,,,,,0 MSG,4,111,11111,0D081C,111111,2016/03/28,21:42:26.280,2016/03/28,21:42:26.258,,,429,206,,,0,,,,,0 MSG,8,111,11111,0D0648,111111,2016/03/28,21:42:26.358,2016/03/28,21:42:26.324,,,,,,,,,,,,0 MSG,3,111,11111,A678EF,111111,2016/03/28,21:42:26.454,2016/03/28,21:42:26.390,,34000,,,30.58389,-98.75544,,,,,,0 MSG,8,111,11111,A33AA7,111111,2016/03/28,21:42:26.478,2016/03/28,21:42:26.455,,,,,,,,,,,,0 MSG,7,111,11111,A678EF,111111,2016/03/28,21:42:26.679,2016/03/28,21:42:26.651,,34000,,,,,,,,,,0 MSG,7,111,11111,0D081C,111111,2016/03/28,21:42:26.759,2016/03/28,21:42:26.717,,35975,,,,,,,,,,0
altitude ICAO hex
lat/long
date & time stamp
message type
speed
#MDBW16
ADS-B in JSON { "timestamp" : ISODate("2016-01-31T20:54:35.000+0000"), "icao" : "AC4144", "callsign" : "N889WM", "altitude" : 9350, "bearing" : 150, "position" : [-98.62762, 30.03657], "ground_speed" : 152, "vertical_rate" : 192 }
#MDBW16
dump1090
dump1090 data flow
Linked List in RAM
HTTP :8080
BaseStation TCP
:30003
[{"hex":"ac741c", "squawk":"6234", "flight":"AAL2417 ", "lat": 30.619176, "lon":-97.755963, "validposition":1, "altitude":35975, "vert_rate":0,"track":202, "validtrack":1, "speed":438, "messages":557, "seen":0}]
AJAX JSON
#MDBW16
dump1090
dump1090 data flow
Linked List in RAM
HTTP :8080
BaseStation TCP
:30003
[{"hex":"ac741c", "squawk":"6234", "flight":"AAL2417 ", "lat": 30.619176, "lon":-97.755963, "validposition":1, "altitude":35975, "vert_rate":0,"track":202, "validtrack":1, "speed":438, "messages":557, "seen":0}]
AJAX JSON
ingest.py MSG,7,111,11111,A3DC34,111111,2016/03/28, 21:42:25.875,2016/03/28,21:42:25.865,,36975
MongoDB TCP
#MDBW16
What Types of Analytics Can We Do?
• Real-time dashboards (<1 second latency) = Aggregation framework • Ad-hoc queries = Aggregation framework • Historical Reports = Aggregation framework or BI Connector • Batch processing = Hadoop • Machine Learning = Spark
#MDBW16
Analytics without Data Migration
Database
Historical Analysis
Devices
Dashboards
DB
DB
ETL
ETL
#MDBW16
Analytics without Data Migration
Database
Historical Analysis
Devices
DB
DB
ETL
ETL Dashboards
#MDBW16
Analytics without Data Migration
Database Historical Analysis
Devices
Dashboards
• No bulk or incremental ETL required • One language for both real-time and ad-hoc queries
#MDBW16
replica set
Workload Isolation
Historical Analysis
Devices
Dashboards
primary
secondary
secondary
Aggregation Framework
#MDBW16
Aggregation framework
#MDBW16
dump1090
dump1090 dashboard
Linked List in RAM
HTTP :8080
BaseStation TCP
:30003
[{"hex":"ac741c", "squawk":"6234", "flight":"AAL2417 ", "lat":30.619176, "lon":-97.755963, "validposition":1, "altitude":35975, "vert_rate":0,"track":202, "validtrack":1, "speed":438, "messages":557, "seen":0}]
AJAX JSON
ingest.py MSG,7,111,11111,A3DC34,111111,2016/03/28, 21:42:25.875,2016/03/28,21:42:25.865,,36975
MongoDB TCP
WT cache
#MDBW16
Real-time Dashboards
• Current Radar, last 5 minutes' worth of aircraft data • pipeline = [
{"$match": {"t": {"$gte": datetime.datetime.utcnow() - datetime.timedelta(minutes=5) }}}, {"$sort": { "icao":1, "t":1 }}, {"$group": {"_id" : {"icao": "$icao"}, "events": {"$push": {"flight":"$callsign", "altitude":"$a", "track":"$b", "speed":"$s", "lon": { "$arrayElemAt":["$p", 0] }, "lat": { "$arrayElemAt":["$p", 1] }, "vert_rate":"$v" }}, "sum": {"$sum":1}}}, {"$project" :{ "_id":0, "icao":"$_id.icao", "events":"$events", "sum":"$sum" }} ]
$match first uses index
pre-built array avoids clumsy looping in
application
#MDBW16
Ad hoc aggregations Which aircraft has the most observations?
> db.tincan.aggregate([ { $group: { _id: "$icao", "sum": {$sum: 1}, "callsigns": {"$addToSet": "$callsign"} }}, { $sort: { "sum": -1 }}, {$limit: 1}
])
{ "_id": ObjectId("5755..."), "icao": "ADE201", "callsign": "N994FE", "a": 8600, "b": 104, "p": [-98.99888, 30.93031], "s": 164, "t": ISODate("2016-02-09T02:33:01Z"), }
#MDBW16
Which aircraft has the most observations?
"result": [ { "_id": "ADE201", "sum": 14373, "callsigns": [ "N994FE" ] }
{ "_id": ObjectId("5755..."), "icao": "ADE201", "callsign": "N994FE", "a": 8600, "b": 104, "p": [-98.99888, 30.93031], "s": 164, "t": ISODate("2016-02-09T02:33:01Z"), }
#MDBW16
ICAO aircraft collection $ mongoimport -d adsb -c aircraft --type csv --headerline aircraft_db.csv icao,regid,mdl,type,operator
000334,PU-PLS,ULAC,EDRA SUPER PETREL LS,PRIVATE OWNER
000D77,PU-VGA,WT9,WT-9 DYNAMIC,PRIVATE OWNER
000D82,PU-DCT,WT9,AEROSPOOL WT9 DYNAMIC,PRIVATE OWNER
001100,-,320,UNKNOWN / VARIOUS,CODE USED BY SEVERAL AIRCRAFT
001108,EJC-1108,AC90,GULFSTREAM 690D,EJERCITO DE COLOMBIA
001411,PU-BGC,RV9,AMATEUR VANS RV-9A,PRIVATE OWNER
002008,LV-S004,P208,TECNAM P-2008,PRIVATE OWNER
003106,PU-FUA,ULAC,AMATEUR GFLY,PRIVATE OWNER
004003,Z-WPB,B732,BOEING 737-2N0,AIR ZIMBABWE
...
#MDBW16
$lookup to find aircraft model > db.tincan.aggregate([
{ $group: { _id: "$icao", "sum": {$sum: 1}, "callsigns": {"$addToSet": "$callsign"} }}, { $sort: { "sum": -1 }}, { $limit: 1 }, { $lookup: { from:"aircraft", localField:"_id", foreignField:"icao", as:"description" }}
])
#MDBW16
$lookup to find aircraft model "result": [ { "_id": "ADE201", "sum": 14373, "callsigns": [ "N994FE" ], "description": [ { "_id": ObjectId("575074300cf625050f2e730e"), "icao": "ADE201", "regid": "N994FE", "mdl": "C208", "type": "CESSNA 208B GRAND CARAVAN" } ]
#MDBW16
FedEx
#MDBW16
Which aircraft is seen the most number of days? > db.tincan.aggregate([
{ $group: { _id: {icao: "$icao", dayOfYear: {$dateToString: { format: "%Y%m%d", date: "$t"}}}}}, {$group:{ _id: "$_id.icao", sum: { $sum: 1 }}},
{ $sort:{ "sum": -1 }}, { $limit: 1 }, { $lookup: { from:"aircraft", localField:"_id", foreignField:"icao", as:"description" }}
])
#MDBW16
Which aircraft is seen the most number of days? "result": [ { "_id": "A35969", "sum": 63, "description": [ { "_id": ObjectId("5762e9cf6ecfc147a0503894"), "icao": "A35969", "regid": "N315AE", "mdl": "B06", "type": "BELL 206L-1 LONGRANGER II", "operator": "AIR EVAC EMS" } ]
#MDBW16
Business Intelligence Connector
#MDBW16
BI Connector • New in MongoDB 3.2 Enterprise Advanced • Mapping and transformation layer • Projects smaller parts of large data sets for reporting
#MDBW16
MongoDB Query Language SQL
BI Connector Data flow
MongoDBBI
Connector
Mappingmetadata
ApplicaAondata
{name: “Andrew”,address: {street:…}}
DocumentTableAnalyAcs&visualizaAon
#MDBW16
FedEx N994FE Flight Paths
#MDBW16
Observations per Operator
#MDBW16
Altitude vs Speed
• Two predictable clusters: • turbine aircraft at cruising
altitude • piston aircraft at lower
altitude
#MDBW16
Altitude vs Speed
• Two predictable clusters: • turbine aircraft at cruising
altitude • piston aircraft at lower
altitude
#MDBW16
Altitude vs Speed
• Two predictable clusters: • turbine aircraft at cruising
altitude • piston aircraft at lower
altitude
• Outliers are Cessnas reporting 51,000+ ft
Spark
#MDBW16
Spark Overview
• fast, general data processing engine • interactive shell • Scala, Java, Python • machine learning libraries (mllib) • supports streaming • HDFS not required
#MDBW16
Spark Connector
Connector
BSON Files
MapReduce & HDFS
#MDBW16
Spark Connector Diagram
• diagram
MongoDB Connector for Hadoop (with Spark Plug-in) https://github.com/mongodb/mongo-hadoop
MongoDB Connector for Spark https://github.com/mongodb/mongo-spark
#MDBW16
Supervised Unsupervised
Classification • Naive Bayes • Support Vector
Machines • Random Decision
Forests
Clustering • K-means
Regression • Linear • Logistic
Dimensionality Reduction • Principal Component
Analysis • Singular Value
Decomposition
Spark Machine Learning
#MDBW16
K-Means Clustering
The K-Means algorithm aims to minimize the sum of squares of the distance between the points and the centroid of each cluster.
source: Lovro Iliassich, toptal.com
#MDBW16
K-Means Clustering
>>> mongo_rdd = sc.mongoRDD('mongodb://localhost:27017/adsb.tincan') OR specify a filter: >>> input_conf = {"mongo.job.input.format": "com.mongodb.hadoop.MongoInputFormat", "mongo.input.uri": "mongodb://localhost:27017/adsb.tincan", "mongo.input.query": '{"t":{"$lte":{"$date":1455494400000}}}' } >>> mongo_rdd = sc.newAPIHadoopRDD(inputFormatClassName, keyClassName, valueClassName, None, None, input_conf)
#MDBW16
K-Means Clustering >>> mongo_rdd = sc.mongoRDD('mongodb://localhost:27017/adsb.tincan') >>> mongo_rdd.first() {u'icao': u'A06690', u'a': 11975, u'b': 150, u'_id': ObjectId('5755bb862355da56d87895cf'), u't': datetime.datetime(2016, 2, 8, 5, 25, 4), u'p': [-98.41437, 30.29066], u's': 285, u'v': -1152}
#MDBW16
K-Means Clustering >>> mongo_rdd = sc.mongoRDD('mongodb://localhost:27017/adsb.tincan') >>> mongo_rdd.first() {u'icao': u'A06690', u'a': 11975, u'b': 150, u'_id': ObjectId('5755bb862355da56d87895cf'), u't': datetime.datetime(2016, 2, 8, 5, 25, 4), u'p': [-98.41437, 30.29066], u's': 285, u'v': -1152} >>> parsed_rdd = mongo_rdd.map(parseData) >>> parsed_rdd.first() [5, 25, 4, 1, 11975, 150, 285, -1152, -98.14857, 30.92651]
#MDBW16
Choosing K
! = ! − !! !
!∈!!
!
!!!
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
0 20 40 60 80 100 120 140 160 180 200
k
Within Set Sum of Squared Error
WSSSE
#MDBW16
Standard Scaling
! = ! − !
!
>>> parsed_rdd.first() [5, 25, 4, 1, 11975, 150, 285, -1152, -98.14857, 30.92651] >>> scaled_features.first() [-1.036, -1.1089, -0.2617, 0.6821, -0.8202, 0.4057, 0.8537, -1.6502, -0.6559, 0.6876]
#MDBW16
K-Means Clustering >>> k = 10 >>> clusters = KMeans.train(parsed_rdd, k, maxIterations=10, runs=1, initializationMode="random") >>> cluster_sizes = parsed_rdd.map(lambda e: clusters.predict(e)).countByValue() >>> cluster_sizes defaultdict(<type 'int'>, {0: 70122, 1: 350890, 2: 118596, 3: 104609, 4: 254759, 5: 175840, 6: 166789, 7: 68309, 8: 147826, 9: 495102})
#MDBW16
Save Results Back to MongoDB def labelData(array): result = {} result['cluster'] = clusters.predict(array) result['daystamp'] = str(array[0]) result['dayofweek'] = array[1] result['hour'] = array[2] result['minute'] = array[3] result['second'] = array[4] result['a'] = array[5] result['b'] = array[6] result['s'] = array[7] result['v'] = array[8] result['p'] = [ array[9], array[10] ] return result
>>> labeled_rdd = parsed_rdd.map(labelData) >>> labeled_rdd.saveToMongoDB('mongodb://localhost:27017/adsb.labeled')
#MDBW16
K-Means Clustering >>> cluster_sizes defaultdict(<type 'int'>, {0: 70122, 1: 350890, 2: 118596, 3: 104609, 4: 254759, 5: 175840, 6: 166789, 7: 68309, 8: 147826, 9: 495102}) Hypothesis: largest cluster #9 is cruising altitude
#MDBW16
Hypothesis: largest cluster #9 is cruising altitude adsb> db.labeled.aggregate([
{$match: {cluster:9}}, {$group: {_id: "summary", "avg_alt": {$avg:"$a"}, "min_alt": {$min:"$a"}, "max_alt": {$max:"$a"} }}])
#MDBW16
Hypothesis: largest cluster #9 is cruising altitude "result": [ { "_id": "summary", "avg_alt": 33630, "min_alt": 30675, "max_alt": 35825 }
#MDBW16
Anomaly Detection
#MDBW16
Anomaly!
• Plane appears 12,000ft out of nowhere
#MDBW16
planefinder.net video
#MDBW16
Don't Worry, He's OK
• 4 days later…
#MDBW16
Summary
MongoDB
Machine Learning
Devices
Historical Reporting
Real-time Dashboard
https://github.com/kerneljake/adsb
#MDBW16
Market Size
$36 Billion
Partners
1,000+
International Offices
15
Global Employees
575+
Downloads Worldwide
15,000,000+
Make a GIANT Impact www.mongodb.com/careers