Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Lecture 3: Document Databases – MongoDB –Aggregation, Map-Reduce, and Distributed Operation
Databases (3): NoSQL & Deductive Databases
Martin Homola, Ján Kľuka, Alexander Šimko, Jozef Šiška
Department of Applied InformaticsFaculty of Mathematics, Physics and Informatics
Comenius University in Bratislava
7 Oct 2021
Aggregation
Aggregation in MongoDB
I So far, we have only done CRUD operationsI MongoDB can also perform aggregation operations:
I Single-purpose: counting and distinct values collectionI Map-Reduce processingI Aggregation pipelines
I Data and work on aggregation operations can bedistributed over many nodes
I Allows processing of “big data” –data sets too large to fit into or be processed by one machine
Aggregation Simple Aggregation
Simple Aggregation
I Counting:db.telefony.countDocuments( { "casti.cislo": { $gte: 190000 } } )
I Distinct values selection:db.telefony.distinct( "casti.predvolba",
{ "casti.cislo": { $gte: 190000 } } )
I countDocuments and distinct are actually computed using muchmore powerful aggregation pipelines
Aggregation Map-Reduce
Map-Reduce Aggregation
I Map-Reduce is quite general mass data-processing frameworkI Collection is processed in 2 main stages:
1. A map JavaScript function takes each documentand emit()s zero or more key-value pairs.
2. Mongo groups emitted pairs by keys.3. All values for one key are sent to a reduce JS function
to produce a single value.
Aggregation Map-Reduce
Map-Reduce Example
Aggregation Map-Reduce
Reduce Requirements and Additional Tools
I The reduce function must (be):I Produce a value of the same type as all input valuesI Associative: reduce(k, [u, reduce(k, [v , w ])]) = reduce(k, [u, v , w ])I Commutative: reduce(k, [u, v ]) = reduce(k, [v , u])I Idempotent: reduce(k, [reduce(k, vs)]) = reduce(k, vs)
I MongoDB also allows:I Selection of documents by a query and a limitI Pre-sortingI Post-processing by a finalize JS functionI outputting the results as a new collection
Aggregation Map-Reduce
mapReduce() Example – AverageLet’s compute for each regionthe average number of inhabitants of its towns:
db.sidla.mapReduce(function() { emit(this.kraj, {
po: this.pocet_obyvatelov,pm: 1 }) },
function(key, values) { return {po: values.reduce( (sum, val) => sum + val.po, 0 ),pm: values.reduce( (sum, val) => sum + val.pm, 0 ) } },
{query: { druh: /mesto/ },finalize: function(key, reducedVal) {
return reducedVal.po / reducedVal.pm},out: ’kraje’
})
Aggregation Aggregation Pipeline
Aggregation Pipeline – aggregate()
I aggregate() processes a collection using a multi-stage pipelineI Similar to Unix shell pipelinesI Less flexible than Map-Reduce, but without slow JavaScriptI Declarative pipeline description
=⇒ can be optimized by the serverI Can also process change streams
Aggregation Aggregation Pipeline
Aggregation Pipeline example
Aggregation Aggregation Pipeline
aggregate() Example – Match, Group, and Sort
Let’s sort regions by the average number of inhabitants of their towns:
db.sidla.aggregate( [{ $match: { druh: /mesto/ } },{ $group: {
_id: "$kraj",ppo: { $avg: "$pocet_obyvatelov" }
} },{ $sort: { ppo: 1 } }
] )
Aggregation Aggregation Pipeline
Joins with $lookup$lookup adds to every document the array all docs from anothercollection with a matching value of some field ' left outer joindb.sidla.aggregate( [
{ $match: { druh: /mesto/ } },{ $lookup: {
from: "kraje",localField: "kraj",foreignField: "nazov",as: "dataKraja"
} },{ $project: {
nazov: 1,podielPoctuObyvatelov: {
$divide: ["$pocet_obyvatelov",{ $arrayElemAt: [ "$dataKraja.pocet_obyvatelov", 0 ] }
]}
} }] )
Aggregation Aggregation Pipeline
Aggregation Pipeline Stages
Pipeline stages can:I select documents ($match), limit, skip;I group by a field or expression ($group, $bucket) and in each group
accumulate the values of other fields (sum, min, max, avg, collect toan array, . . . );
I join documents with documents from a collection ($lookup)also recursively ($graphLookup)
I explode an array field into one document for each element ($unwind);I project, set, unset, and compute fields;I sort documents;I merge with an existing collection or output to a new collection...
Distribution
Distributed Operation
I MongoDB can distribute big collections over multiple server nodesI A collection can be partitioned into shards –
distribution of storage and processing powerI Unshared collections and shards can replicated in replica sets –
safety, distribution of processing power for readsI Multi-document transactions available only in replica sets
(but a single “replica”, i.e., just the original, is enough)I Aggregation pipelines and map-reduce run distributed over shards
Distribution
Replication
Distribution
Replica Set Creation
$ mongod --replSet "rs0" --bind_ip localhost,mongodb0.example.net...
$ mongo...> rs.initiate( {
_id : "rs0",members: [
{ _id: 0, host: "mongodb0.example.net:27017" },{ _id: 1, host: "mongodb1.example.net:27017" },{ _id: 2, host: "mongodb2.example.net:27017" }
]})
Distribution
Sharding
Distribution
Enabling Sharding
Sharding config server, shard servers, and routers must be started
sh.addShard("mongo1.example.net:27017")sh.addShard("mongo2.example.net:27017")sh.enableSharding("prednaska")// Shard a collection by a keysh.shardCollection("prednaska.telefony",
{ "casti.predvolba": 1 } )// orsh.shardCollection("prednaska.telefony",
{ formatovane: "hashed" } )// orsh.shardCollection("prednaska.telefony",
{ "casti.predvolba": 1, "casti.cislo": 1 } )
Distribution
References
I Eric Redmond and Jim R. Wilson: Seven Databases in Seven Weeks.The Pragmatic Bookshelf: Dallas, Texas; Raleigh, North Carolina,2012.
I The MongoDB 4.4 Manual. MongoDB, Inc, 2020. [online]https://docs.mongodb.com/manual/