34
MongoDB - Aggregation Pipeline Jason Terpko DBA @ Rackspace/ObjectRocket linkedin.com/in/jterpko 1

MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline? 5 A framework for data visualization and or manipulation using one ore multiple stages in

  • Upload
    others

  • View
    43

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

MongoDB - Aggregation Pipeline

Jason Terpko DBA @ Rackspace/ObjectRocket linkedin.com/in/jterpko

1

Page 2: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Background

www.objectrocket.com 2

Page 3: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Overview

www.objectrocket.com 3

o  Aggregation Framework

o  Pipeline Stages

o  Operators

o  Performance

o  New Features

Page 4: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Aggregation Pipeline

www.objectrocket.com 4

o  Overview

o  Stages

o  Operators

o  Multiple Stage Example

Page 5: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

What is the Aggregation Pipeline?

www.objectrocket.com 5

A framework for data visualization and or manipulation using one ore multiple stages in order (i.e. pipeline).

•  Framework - Allows for the transformation of data through stages, the result can be an array, cursor, or even a collection

•  Visualization – Data transformation is not required at all times, this framework can be used for basic counts, summations, and grouping

•  Manipulation – Using stages the documents can be transformed as they pass through each stage, this prepares the data for the next stage or the final result set

•  Output – The result can be iterated over using a cursor or saved to a collection within the same database

•  Expandable – New stages and operators are added with each major version and in 3.4 views leverage the aggregation framework

Page 6: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

All Stages

www.objectrocket.com 6

$collStats $project $match $redact $limit $skip

$unwind $group $sample

$sort $geoNear

$lookup $out

$indexStats $facet

$bucket $bucketAuto

$sortByCount $addFields

$replaceRoot $count

$graphLookup

Page 7: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Common Stages

www.objectrocket.com 7

$match $group $project $sort $limit $unwind $out

-  Filter (reduce) the number of documents that is passed to the next stage

-  Group documents by a distinct key, the key can also be a compound key

-  Pass documents with specific fields or newly computed fields to the next stage

-  Returns the input documents in sorted order

-  Limit the number of documents for the next stage

-  Splits an array into into one document for each element in the array

-  As the last stage, creates/replaces an unsharded collection with the input documents

Page 8: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Common Operators

www.objectrocket.com 8

Group Operators $sum $avg $max $min $first $last

Date Operators $year $month $week $hour $minute $second

Arithmetic Operators $abs $add $multiply $subtract $trunc

Operators that return a value based on document data.

Operators that return true or false based on document data.

Comparison Operators $eq $gt $lt $gte $lte

Boolean Operators $and $or

Page 9: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Aggregate()

www.objectrocket.com 9

db.changelog.aggregate([ {$match : {"details.note":"success", "details.step 6 of 6": {$gte:0}}}, {$sort: {time:-1}}, {$limit: 100}, {$project : {'totalTime' : { '$add' : [ "$details.step 1 of 6","$details.step 2 of 6", "$details.step 3 of 6","$details.step 4 of 6", "$details.step 5 of 6","$details.step 6 of 6" ] } } }, {$group: {_id: null, averageTotalTime: {$avg: "$totalTime"} } } ]);

Collection

Purpose: Return the average number of milliseconds to move a chunk for the last one hundred moves.

Page 10: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

$match

www.objectrocket.com 10

db.changelog.aggregate([ {$match : {"details.note":"success", "details.step 6 of 6": {$gte:0}}}, {$sort: {time:-1}}, {$limit: 100}, {$project : {'totalTime' : { '$add' : [ "$details.step 1 of 6","$details.step 2 of 6", "$details.step 3 of 6","$details.step 4 of 6", "$details.step 5 of 6","$details.step 6 of 6" ] } } }, {$group: {_id: null, averageTotalTime: {$avg: "$totalTime"} } } ]);

Stage 1

Purpose: In the first stage filter only the chunks that moved successfully.

Comparison Operator

Page 11: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

$sort

www.objectrocket.com 11

db.changelog.aggregate([ {$match : {"details.note":"success", "details.step 6 of 6": {$gte:0}}}, {$sort: {time:-1}}, {$limit: 100}, {$project : {'totalTime' : { '$add' : [ "$details.step 1 of 6","$details.step 2 of 6", "$details.step 3 of 6","$details.step 4 of 6", "$details.step 5 of 6","$details.step 6 of 6" ] } } }, {$group: {_id: null, averageTotalTime: {$avg: "$totalTime"} } } ]);

Stage 2

Purpose: Sort descending so we are prioritizing the most recent moved chunks.

Page 12: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

$limit

www.objectrocket.com 12

db.changelog.aggregate([ {$match : {"details.note":"success", "details.step 6 of 6": {$gte:0}}}, {$sort: {time:-1}}, {$limit: 100}, {$project : {'totalTime' : { '$add' : [ "$details.step 1 of 6","$details.step 2 of 6", "$details.step 3 of 6","$details.step 4 of 6", "$details.step 5 of 6","$details.step 6 of 6" ] } } }, {$group: {_id: null, averageTotalTime: {$avg: "$totalTime"} } } ]);

Stage 3

Purpose: Further reduce the number of moves being analyzed because time to move a chunk varies by chunk and collection.

Page 13: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

$project

www.objectrocket.com 13

db.changelog.aggregate([ {$match : {"details.note":"success", "details.step 6 of 6": {$gte:0}}}, {$sort: {time:-1}}, {$limit: 100}, {$project : {'totalTime' : { '$add' : [ "$details.step 1 of 6","$details.step 2 of 6", "$details.step 3 of 6","$details.step 4 of 6", "$details.step 5 of 6","$details.step 6 of 6" ] } } }, {$group: {_id: null, averageTotalTime: {$avg: "$totalTime"} } } ]);

Stage 4

Purpose: For each moveChunk document project the sum of the steps to the next stage.

Arithmetic Operator

Page 14: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

$group

www.objectrocket.com 14

db.changelog.aggregate([ {$match : {"details.note":"success", "details.step 6 of 6": {$gte:0}}}, {$sort: {time:-1}}, {$limit: 100}, {$project : {'totalTime' : { '$add' : [ "$details.step 1 of 6","$details.step 2 of 6", "$details.step 3 of 6","$details.step 4 of 6", "$details.step 5 of 6","$details.step 6 of 6" ] } } }, {$group: {_id: null, averageTotalTime: {$avg: "$totalTime"} } } ]);

Stage 5

Purpose: Return the average number of milliseconds to move a chunk for the last one hundred moves.

Arithmetic Operator

Page 15: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Optimizations

www.objectrocket.com 15

o  Projections

o  Sequencing

o  Indexing

o  Sorting

Page 16: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Projections

www.objectrocket.com 16

When using $project stage Mongo will read and pass less data to the next stage. By doing this it will require less CPU, RAM, and reduce the disk IO to process the aggregation.

db.jobs.aggregate([ {$match : {"type": "import"}}, {$sort: {"cluster": 1}}, {$project : { cluster: 1, type:1, seconds:1, _id: 0} }, {$group: {_id: {cluster: "$cluster", type: "$type"}, avgExecTime: {$avg: "$seconds"} } } ]);

Stage 3

By default Mongo will try to determine if a subset of fields are required, if so it will request only those fields and optimize the stage for you.

Page 17: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Sequencing

www.objectrocket.com 17

When stages can be ordered more efficiently, Mongo will reorder those stages for you to improve execution time.

db.jobs.aggregate([ {$sort: {"cluster": 1}}, {$match : {"type": "import"}}, {$project : { cluster: 1, type:1, seconds:1, _id: 0} }, {$group: {_id: {cluster: "$cluster", type: "$type"}, avgExecTime: {$avg: "$seconds"} } } ]);

By filtering documents first the number of documents to be sorted is reduced.

Page 18: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Sequencing

www.objectrocket.com 18

When stages can be ordered more efficiently, Mongo will reorder those stages for you to improve execution time.

db.jobs.aggregate([ {$match : {"type": "import"}}, {$sort: {"cluster": 1}}, {$project : { cluster: 1, type:1, seconds:1} }, {$group: {_id: {cluster: "$cluster", type: "$type"}, avgExecTime: {$avg: "$seconds"} } } ]);

In addition to sequence optimizations Mongo can also coalesce stages, for example a $match stage followed by another $match will become one stage. A full list of sequence and coalesce optimizations can be viewed at Aggregation Pipeline Optimization.

Page 19: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Indexing and Data Merging

www.objectrocket.com 19

Only two stages have the ability to utilize indexes, the $match stage and the $sort stage. Starting in version 3.2 an index can cover an aggregation. Like find() you can generate an explain plan for an aggregation to view a more detail execution plan.

To use an index, these stages must be the first stages in the pipeline.

Also released in version 3.2 for aggregations: •  Data that does not require the primary shard no longer has to be merged on the primary shard. •  Aggregations that include the shard key in the $match stage and don’t require data from other

shards can execute entirely on the target shard.

Page 20: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Memory

www.objectrocket.com 20

Stages have a limit of 100MB of RAM, this restriction is the most common restriction one encounters when using the aggregation framework. To exceed this limitation use the allowDiskUse option to allow stages like $sort to use temporary files.

db.jobs.aggregate([ {$match : {"type": "import"}}, {$sort: {"cluster": 1}}, {$project : { cluster: 1, type:1, seconds:1} }, {$group: {_id: {cluster: "$cluster", type: "$type"}, avgExecTime: {$avg: "$seconds"} } } ], {allowDiskUse: true});

This option should be used with caution in production due to added resource consumption.

Page 21: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

New In 3.4

www.objectrocket.com 21

o  Recursive Search

o  Faceted Search

o  Views

Page 22: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Recursive Search

www.objectrocket.com 22

Recursively search a collection using $graphLookup. This stage in the pipeline takes input from either the collection or a previous stage (e.g. $match).

{ $graphLookup: { from: "users", startWith: "$connections", connectFromField: "connections", connectToField: "name", as: "connections", } }

Considerations •  This stage is limited to 100M of

RAM even with allowDiskUse option

•  maxDepth of zero is equivilent to $lookup

•  Collation must be consistent when involving multiple views

Page 23: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Recursive Search

www.objectrocket.com 23

Users Collection:

{ "_id" : 101, "name" : "John”, "connections" : ["Jane", "David"] } { "_id" : 102, "name" : "David”, "connections" : ["George"] } { "_id" : 103, "name" : "George", "connections" : ["Melissa"] } { "_id" : 104, "name" : "Jane", "connections" : ["Jen"] } { "_id" : 105, "name" : "Melissa”, "connections" : ["Jason"] } { "_id" : 106, "name" : "Nick", "connections" : ["Derek"] }

Page 24: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Recursive Search

www.objectrocket.com 24

db.users.aggregate( [ { $match: { "name": "John" } }, { $graphLookup: { from: "users", startWith: "$connections", connectFromField: "connections", connectToField: "name", as: "connections", } }, { $project: { "_id": 0, "name": 1, "known connections": "$connections.name" } } ] ).pretty();

Aggregation:

{ "name": "John", "known connections": [ "Melissa", "George", "Jane", "David” ] }

Result:

Page 25: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Faceted Search

www.objectrocket.com 25

{ "_id" : 101, "name" : "Perf T", "price" : NumberDecimal("19.99"), "colors" : [ "red","white" ], "sizes" : ["M", "L", "XL"] } { "_id" : 102, "name" : "Perf V-Neck", "price" : NumberDecimal("24.99"), "colors" : [ "white", "blue" ], "sizes" : ["M", "L", "XL"] } { "_id" : 103, "name" : "Perf Tank", "price" : NumberDecimal("14.99"), "colors" : [ "red", "blue" ], "sizes" : ["M", "L"] } { "_id" : 104, "name" : "Perf Hoodie", "price" : NumberDecimal("34.99"), "colors" : [ "blue" ], "sizes" : ["M", "L", "XL"] }

Sample Data:

$facet allows you to process multiple pipelines with in a single aggregation stage. The sub-pipelines take the same input documents and output one document in the stage output.

Page 26: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Faceted Search

www.objectrocket.com 26

db.store.aggregate( [ { $facet: { "categorizedByColor": [ { $unwind: "$colors" }, { $sortByCount: "$colors" } ], "categorizedBySize": [ { $unwind: "$sizes" }, { $sortByCount: "$sizes" } ], "categorizedByPrice": [ { $bucketAuto: { groupBy: "$price”, buckets: 2 } } ] } } ]).pretty()

Command:

Page 27: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Faceted Search

www.objectrocket.com 27

{ "categorizedByColor": [ { "_id": "blue", "count": 3 }, { "_id": "white", "count": 2 }, { "_id": "red", "count": 2 } ], ……….

"categorizedBySize": [{ "_id": "L", "count": 4 }, { "_id": "M", "count": 4 }, { "_id": "XL", "count": 3 } ], ……….

"categorizedByPrice": [{ "_id": { "min": NumberDecimal("14.99"), "max": NumberDecimal("24.99") }, "count": 2 }, { "_id": { "min": NumberDecimal("24.99"), "max": NumberDecimal("34.99") }, "count": 2 } ] }

Page 28: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Views

www.objectrocket.com 28

A read-only object that can be queried like the underlying collection. A view is created using an aggregation pipeline and can be used to transform data or limit data access from another collection.

•  Computed on demand for each read operation •  Use indexes from the underlying collection •  Names are immutable, to change the name drop and recreate •  Can be created on sharded collections •  Are listed as collections in getCollectionNames() •  Allows for more granular access controls than RBAC via views

Page 29: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Example View

www.objectrocket.com 29

{ "_id": 101, "first_name": "John", "last_name": "Doe", ”dept": ”123", ”role": ”DBA", ”expense_id": 1234, ”amt": ”35.00", ”c_date": ISODate("2017-02-25T17:08:46.166Z") }

Documents:

Page 30: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Example View

www.objectrocket.com 30

db.createView( "recentExpenses", "expenses", [ { $match: {"c_date": {$gte: new Date(new Date()-86400*1000)}}}, { $project: { "_id": 1, "first_name": 1, "last_name": 1, "amt": 1 } }, { $sort: {"c_date": -1}} ] );

Create View:

Page 31: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Example View

www.objectrocket.com 31

> show collections system.views expenses recentExpenses > db.system.views.find() { "_id" : ”mydb.recentExpenses", "viewOn" : "expenses", "pipeline" : [ { ”$match …… > db.recentExpenses.find() { "_id" : 103, "first_name" : "John", "last_name" : "Doe", "amt" : "35.00" } { "_id" : 102, "first_name" : ”Jane", "last_name" : ”Smith", "amt" : ”36.00" } { "_id" : 101, "first_name" : ”Mike", "last_name" : ”Adams", "amt" : ”33.00" }

Collections:

Metadata:

Usage:

Page 32: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Questions?

www.objectrocket.com 32

Page 33: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

www.objectrocket.com 33

We’re Hiring! Looking to join a dynamic & innovative

team?

Justine is here at Percona Live 2017, Ask to speak with her!

Reach out directly to our Recruiter at [email protected]

Page 34: MongoDB - Aggregation Pipeline · What is the Aggregation Pipeline?  5 A framework for data visualization and or manipulation using one ore multiple stages in

Thank you! Address: 401 Congress Ave Suite 1950 Austin, TX 78701 Support: 1-800-961-4454 Sales: 1-888-440-3242

www.objectrocket.com

34