165
1 Ghislain Fourny Big Data Fall 2019 12. Document stores pinkyone / 123RF Stock Photo Ilya Akinshin / 123RF Stock Photo

Ghislain Fourny Big Data Fall 2019 - systems.ethz.ch · with XML/JSON? XML, JSON Trees HDFS UTF-8 Spark XML Schema??? 1212 The semi-structuredstack Bits 12 Text ... Querying a document

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

1

Ghislain Fourny

Big Data Fall 2019

12. Document stores

pinkyone / 123RF Stock Photo

Ilya Akinshin / 123RF Stock Photo

22

From SQL to NoSQL

33

The structured stack

3

Bits

Text

Well-formed CSV

Relational schema

Queryable tables

44

Relational Model: Tables

4

A B C D

a 1 alpha foo

a 2 alpha bar

a 3 beta foo

"Everything is a table"

Relational integrity

Atomic integrity

55

Relational Model: Schemas

5

A B C D

string integer char(3) date

Atomic types assigned to each column

Relational integrity

Atomic integrity

Domain integrity

66

Relational Algebra

6

Project

Select

Join

Group

Sort

77

Relational Syntax: CSV

ID,Last name,First name,Theory,

1,Einstein,Albert,"General, Special Relativity"

2,Gödel,Kurt,"""Incompleteness"" Theorem"

Physical view

Syntax

ID Last name First name Theory

1 Einstein Albert General, Special Relativity

2 Gödel Kurt "Incompleteness" Theorem

Logical view

Data Model

88

Relational Language: SQL

SELECT century AS c

FROM persons

GROUP BY century

HAVING COUNT(*) > 2

name middle_initial last_name century captain

varchar(30) char(1) text integer boolean

James T Kirk 23 TRUE

Beverly C Crusher 24 FALSE

Jean-Luc NULL Picard 24 TRUE

Kathryn NULL Janeway 24 TRUE

persons

century

integer

24

99

The stack

Storage

Encoding

Syntax

Data models

Validation

Processing

Indexing

Data stores

User interfaces

Querying

1010

The stack

Storage

Encoding

Syntax

Data models

Validation

Processing

Indexing

Data stores

User interfaces

Querying

We already

rebuilt this stack

with tables

HBase

SQL

CSV

DataFrames

HDFS

UTF-8

Spark

1111

The stack

Storage

Encoding

Syntax

Data models

Validation

Processing

Indexing

Data stores

User interfaces

QueryingNow, can we

rebuild this all

with XML/JSON?

XML, JSON

Trees

HDFS

UTF-8

Spark

XML Schema

?

?

?

1212

The semi-structured stack

12

Bits

Text

Well-formed XML/JSON

Valid XML/JSON

Queryable XML/JSON

1313

Making trees fit into tables

1414

Flat trees

14

{

"foo": 1,

"bar": "foo",

"foobar" : true,

"a" : "bar",

"b" : 3.14

}

foo bar foobar a b

1 foo true bar 3.14

1515

Flat trees

15

<row>

<foo>1</foo>

<bar>foo</bar>

<foobar>true</foobar>

<a>foo</a>

<b>3.14</b>

</row>

foo bar foobar a b

1 foo true foo 3.14

1616

Collections of flat trees

16

<row>

<foo>1</foo>

<bar>foo</bar>

<foobar>true</foobar>

<a>foo</a>

<b>3.14</b>

</row>

foo bar foobar a b

1 foo true foo 3.14

2 bar false bar 4.2

<row>

<foo>a</foo>

<bar>bar</bar>

<foobar>false</foobar>

<a>bar</a>

<b>4.2</b>

</row>

1717

Schemas: from SQL to NoSQL

17

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="row">

<xs:complexType>

<xs:sequence>

<xs:element name="foo" type="xs:integer"/>

<xs:element name="bar" type="xs:string"/>

<xs:element name="foobar" type="xs:boolean"/>

<xs:element name="a" type="xs:string"/>

<xs:element name="b" type="xs:decimal"/>

</xs:sequence>

</xs:complexType>

</xs:element>

</xs:schema>

foo bar foobar a b

integer string boolean string decimal

1818

NoSQL

But with JSON and XML, we can have

Nestedness

Heterogeneity

and

Ato

mic

in

teg

rity

(Fir

st norm

al fo

rm)

Re

latio

na

l in

teg

rity

Do

ma

in in

teg

rity

1919

Nested arrays{

"category ": 1,"job": "mathematician",

"name": [ {

"last": "Ramanujan",

"first": "Srinivasa"},

{

"last": "Gödel",

"first": "Kurt"} ]

}

{"category": 2,

"job": "physicist"",

"name": [ {

"last": "Einstein","first": "Albert"

} ]

}

...

category job

1 mathematician

2 physicist

category name.last name.first

1 Ramanujan Srinivasa

1 Gödel Kurt

2 Einstein Albert

2020

Heterogeneity

{

"id": 1,

"profession": "physicist"

"last name": "Einstein"

}

}

{

"id": 2,

"profession": "engineer"

}

{

"id": 3,

"first name": "Kurt"

}

id profession last

name

first

name

1 physicist Einstein NULL

2 engineer NULL NULL

3 NULL NULL Kurt

2121

Impedance mismatch

2222

Document stores

Serg_v / 123RF Stock Photo

2323

Document stores

23

Scale up: millions to billions

(TB, PB)

2424

Documents{

"foo": 1,

"bar": "foo",

"name": [ {

"last": "Einstein",

"first": "Albert"

},

{

"last": "Gödel",

"first": "Kurt"

} ]

}

<row>

<foo>1</foo>

<bar>foo</bar>

<names>

<name>

<last>Einstein></last>

<first>Albert</first>

</name>

<name>

<last>Gödel</last>

<first>Kurt</first>

</name>

</names>

</row>

XML

JSON

2525

Collection of trees

25

{

"foo": 1,

"bar": [ "foo", "bar" ],

"foobar" : true,

"a" : { "foo" : null, "b" : [ 3, 2 ] },

"b" : 3.14

}

{

"foo": 1,

"bar": "foo"

}

{

"foo": 2,

"bar": [ "foo", "foobar" ],

"foobar" : false,

"a" : { "foo" : "foo", "b" : [ 3, 2 ] },

"b" : 3.1415

}

2626

Collection of trees

26

{

"foo": 1,

"bar": [ "foo", "bar" ],

"foobar" : true,

"a" : { "foo" : null, "b" : [ 3, 2 ] },

"b" : 3.14

}

{

"foo": 1,

"bar": "foo"

}

{

"foo": 2,

"bar": [ "foo", "foobar" ],

"foobar" : false,

"a" : { "foo" : "foo", "b" : [ 3, 2 ] },

"b" : 3.1415

}

Typically small documents

2727

Collection of trees

27

{

"foo": 1,

"bar": "foo"

}

{

"foo": 2,

"bar": [ "foo", "foobar" ],

"foobar" : false,

"a" : { "foo" : "foo", "b" : [ 3, 2 ] },

"b" : 3.1415

}

Typ

ica

lly la

rge

(up to t

ho

usan

ds

, m

illi

on

s, b

illio

ns

of

ob

jects

)

{

"foo": 1,

"bar": [ "foo", "bar" ],

"foobar" : true,

"a" : { "foo" : null, "b" : [ 3, 2 ] },

"b" : 3.14

}

2828

Tree vs. flat

vs.

2929

Homogeneous vs. Heterogeneous

vs.

3030

Document Stores vs. RDBMS

Projection

Selection

Aggregation

Joins

3131

NoSQL: validation after the data was populated

<row>

<foo>1</foo>

<bar>foo</bar>

<foobar>true</foobar>

<a>foo</a>

<b>3.14</b>

</row>

<row>

<foo>a</foo>

<bar>bar</bar>

<foobar>3</foobar>

<a>null</a>

<b>foo</b>

</row>

3232

Implementations

32

3333

Implementations

3434

Encoding

3535

Encoding

35

Character

0s and 1s

3636

Common Character Encodings

36

ASCII

ISO Latin 1

(a.k.a. ISO-8859-1)

UTF-8

UTF-16

3737

ASCII

37

ASCII Code Chart, scanner copied from the material delivered with TermiNet 300

impact type printer with Keyboard, February 1972, General Electric Data communication Product Dept., Waynesboro VA. http://archive.computerhistory.org/resources/text/GE/GE.TermiNet300.1971.10264620

7.pdf

3838

UTF-8

38

π

03A0

11 10100000

11001110 10100000

3939

BSON

39

{ "foo" : null }

6 \x03 \x66 \x6F \x6F \x00 \x0A \x00

4040

The MongoDB stack

40

Bits (BSON)

Well-formed JSON

Valid JSON

Queryable JSON

4141

Querying a document store

Serg_v / 123RF Stock Photo

4242

CRUD

42

Create

Read

Update

Delete

4343

Read: selecting all documents

43

db.scientists.find({})

db.scientists.find()

SELECT * FROM scientists

SQL CheatSheet

{ "First" : "Albert", "Last" : "Einstein", "Theory": "Relativity" }

{ "First" : "Isaac", "Last" : "Newton", "Theory": "Gravitation" }{ "First" : "Kurt", "Last" : "Gödel", "Theory": "Relativity" }

{ "First" : "Hermann", "Last" : "Minkowski", "Theory": "Relativity" }

4444

Read

44

db.scientists.find(

{ "Theory" : "Relativity" }

)

{ "First" : "Albert", "Last" : "Einstein", "Theory": "Relativity" }

{ "First" : "Isaac", "Last" : "Newton", "Theory": "Gravitation" }{ "First" : "Kurt", "Last" : "Gödel", "Theory": "Incompleteness" }

{ "First" : "Hermann", "Last" : "Minkowski", "Theory": "Relativity" }

4545

Read

45

db.scientists.find(

{ "Theory" : "Relativity" }

)

SELECT *

FROM scientists

WHERE Theory = "Relativity"

SQL CheatSheet

{ "First" : "Albert", "Last" : "Einstein", "Theory": "Relativity" }

{ "First" : "Isaac", "Last" : "Newton", "Theory": "Gravitation" }{ "First" : "Kurt", "Last" : "Gödel", "Theory": "Incompleteness" }

{ "First" : "Hermann", "Last" : "Minkowski", "Theory": "Relativity" }

4646

Read: projection

46

db.scientists.find(

{ "Theory" : "Relativity" },

{ "Name" : 1, "Last": 1 }

)

4747

Read: projection

47

db.scientists.find(

{ "Theory" : "Relativity" },

{ "Name" : 1, "Last": 1 }

)

WHERE

SELECT

SELECT Name, Last

FROM scientists

WHERE Theory = "Particle Physics"

SQL CheatSheet

4848

Read: projection

db.scientists.find(

{ "Theory" : "Relativity" },

{ "Name" : 1, "Last": 1 }

)

SELECT Name, Last

FROM scientists

WHERE Theory = "Particle Physics"

SQL CheatSheet

{ "First" : "Albert", "Last" : "Einstein", "Theory": "Relativity" }

{ "First" : "Isaac", "Last" : "Newton", "Theory": "Gravitation" }{ "First" : "Kurt", "Last" : "Gödel", "Theory": "Incompleteness" }

{ "First" : "Hermann", "Last" : "Minkowski", "Theory": "Relativity" }

{ "First" : "Albert", "Last" : "Einstein" }

{ "First" : "Hermann", "Last" : "Minkowski" }

4949

Read: AND

db.scientists.find(

{

"Theory" : "Relativity",

"Last" : "Einstein"

}

)

SELECT Name, Last

FROM scientists

WHERE Theory = "Particle Physics"

AND Last = "Einstein"

SQL CheatSheet

{ "First" : "Albert", "Last" : "Einstein", "Theory": "Relativity" }

{ "First" : "Isaac", "Last" : "Newton", "Theory": "Gravitation" }{ "First" : "Kurt", "Last" : "Gödel", "Theory": "Relativity" }

{ "First" : "Hermann", "Last" : "Minkowski", "Theory": "Relativity" }

5050

Read: OR

db.scientists.find({

"$or" : [{ "Last" : "Newton" },{ "Last" : "Einstein" }

]}

)

SELECT Name, Last

FROM scientists

WHERE Last = "Newton"

OR Last = "Einstein"

SQL CheatSheet

{ "First" : "Albert", "Last" : "Einstein", "Theory": "Relativity" }

{ "First" : "Isaac", "Last" : "Newton", "Theory": "Gravitation" }{ "First" : "Kurt", "Last" : "Gödel", "Theory": "Relativity" }

{ "First" : "Hermann", "Last" : "Minkowski", "Theory": "Relativity" }

5151

Read: Comparison

db.scientists.find({ "Publications" : { $gte : 100 } }

)

SELECT Name, Last

FROM scientists

WHERE Publications >= 100

SQL CheatSheet

{ "First" : "Albert", "Last" : "Einstein", "Publications": 500}

{ "First" : "Isaac", "Last" : "Newton", "Publications ": 30}{ "First" : "Kurt", "Last" : "Gödel", "Publications ": 400}

{ "First" : "Hermann", "Last" : "Minkowski", " Publications ": 50}

5252

Heterogeneity

52

db.scientists.find(

{ "Theory" : "Relativity" }

)

SELECT *

FROM scientists

WHERE Theory = "Relativity"

SQL CheatSheet

{ "First" : "Albert", "Last" : "Einstein", "Theory": "Relativity" }

{ "First" : "Isaac", "Last" : "Newton", "Theory": false }{ "First" : "Kurt", "Last" : "Gödel", "Theory": "Incompleteness" }

{ "First" : "Hermann", "Last" : "Minkowski", "Theory": "Relativity" }

{ "First" : "Niels", "Last" : "Bohr" }

Other type

Missing field

5353

Heterogeneity

53

db.scientists.find(

{ "Theory" : null }

)

SELECT *

FROM scientists

WHERE Theory IS NULL

SQL CheatSheet

{ "First" : "Albert", "Last" : "Einstein", "Theory": "Relativity" }

{ "First" : "Isaac", "Last" : "Newton", "Theory": "Gravitation" }{ "First" : "Kurt", "Last" : "Gödel", "Theory": "Incompleteness" }

{ "First" : "Hermann", "Last" : "Minkowski", "Theory": "Relativity" }

{ "First" : "Niels", "Last" : "Bohr" }

5454

Read: nestedness (objects)

54

db.scientists.find({

"Name.First" : "Albert"

})?

5555

Read: nestedness (objects)

55

db.scientists.find({

"Name.First" : "Albert"

})

{

"Name" : {"First" : "Albert",

"Last" : "Einstein"

},

"Theories": [ "Relativity" ]}

{

"Name" : {"First" : "Albert",

"Last" : "Zweistein"

},

"Theories": [ "Unification" ]}

{

"Name" : {"First" : "Kurt",

"Last" : "Gödel"

},

"Theories": [ "Incompleteness" ]}

5656

Read: possible confusion

db.scientists.find({

"Name" : { "First" : "Albert" }

})

{

"Name" : {"First" : "Albert",

"Last" : "Einstein"

},

"Theories": [ "Relativity" ]}

{

"Name" : {

"First" : "Albert"

},

"Theories": [ "Unification" ]

}

{

"Name" : {"First" : "Kurt",

"Last" : "Gödel"

},

"Theories": [ "Incompleteness" ]}

5757

Read: nestedness (arrays)

57

db.scientists.find({

"Theories" : "Special relativity"

})

{

"Name" : {

"First" : "Albert",

"Last" : "Einstein"

},

"Theories": [

"Special relativity",

"General relativity"

]

}

{

"Name" : {

"First" : "Kurt",

"Last" : "Gödel"

},

"Theories": [ "Incompleteness" ]

}

5858

Read: other operators

58

db.scientists.find({

"University" : {

$in : [ "ETH Zurich", "EPFL" ]

}

})

{

"Name" : {"First" : "Albert",

"Last" : "Einstein"

},

"University" : "ETH Zurich"}

{

"Name" : {"First" : "Kurt",

"Last" : "Gödel"

},

"University" : "Uni Wien"}

5959

Read: other operators

59

db.scientists.find({

"University" : {

$nin : [ "ETH Zurich", "EPFL" ]

}

})

{

"Name" : {"First" : "Albert",

"Last" : "Einstein"

},

"University" : "ETH Zurich"}

{

"Name" : {"First" : "Kurt",

"Last" : "Gödel"

},

"University" : "Uni Wien"}

6060

Count

60

db.scientists.find({"University" : {$in : [ "ETH Zurich", "EPFL" ]

}}).count()

6161

Sort

61

db.scientists.find({"University" : {$in : [ "ETH Zurich", "EPFL" ]

}}).sort({"Founded" : -1 })

6262

Limit and offset

62

db.scientists.find({"University" : {$in : [ "ETH Zurich", "EPFL" ]

}}).sort({"Founded" : -1 }).skip(30).limit(10)

6363

Duplicates

63

db.scientists.distinct("name")

6464

Aggregation and pipelines

64

db.scientists.aggregate({ $match : { "Century" : 20 },{ $group : { "Year" : "$year", "Count" : { "$sum" : 1 } } },{ $sort : { "Count" : -1 } },{ $limit : 5 }

)

Pipeline

6565

Aggregation and pipelines

65

db.scientists.aggregate({ $match : { "Century" : 20 },{ $group : { "Year" : "$year", "Count" : { "$sum" : 1 } } },{ $sort : { "Count" : -1 } },{ $limit : 5 }

)

Pipeline

Like MapReduce and Spark!

STAGESTRANFORMATION

ACTIONCREATION

But we'll see a much easier way next week.

6666

Insert

66

db.flights.insertOne(

{ "Name" : "Einstein", "Theory" : "Relativity" }

)

6767

Update

67

db.scientists.updateMany({ "Name" : "Einstein" },{ $set : { "Century" : "20" } }

)

6868

Remove

68

db.scientists.deleteMany(

{ "century" : "15" }

)

6969

Writing: atomicity

69

Granularity of atomicity:

one document

7070

Query document stores on a higher level?

{JSONiq}

<XQuery/>

Next week!

UNQL

7171

Architecture

Serg_v / 123RF Stock Photo

7272

Principle 8. Shard the data

72

7373

Principle 9. Replicate the data

73

7474

Principle 10. Buy lots of cheap hardware

74

7575

Replication in document stores

Master

7676

Clustering

A-E F-P Q-Z

A-E F-P Q-Z

A-E F-P Q-Z

Replica Set

7777

Replica sets on physical level

Primary

Secondary Secondary

Replica set

7878

Replica sets on physical level

Primary

Secondary Secondary

Replica set

Primary

Secondary Secondary

Replica set

Primary

Secondary Secondary

Replica set

Shard 1 Shard 2 Shard 3

7979

Write concerns

write

8080

Write concerns

write

8181

Write concerns

write

8282

Write concerns

write

8383

Write concerns

write

8484

Write concerns

write

8585

Indices

85

Ilya Akinshin / 123RF Stock Photo

_______________

_______________

_______________

_______________

______________________________

_______________

_______________

_______________

______________________________

8686

Indices

{

Name: "Apple", "Color": [ "green", "red" ]

}

{

Name: "Orange", "Color": [ "orange" ]

}

{

Name: "Banana", "Color": [ "yellow" ]

}

{

Name: "Kiwi", "Color": [ "brown", "green" ]

}

{

Name: "Ananas", "Color": [ "yellow" ]

}

8787

Big collections

{"Name":"Einstein", "Profession":"Physicist"}

{"Name":"Gödel", "Profession":"Mathematician"}

{"Name":"Ramanujan", "Profession":"Mathematician "}

{"Name":"Pythagoras", "Profession":"Mathematician "}

{"Name":"Turing", "Profession":"Computer Scientist"}

{"Name":"Church", "Profession":"Computer Scientist"}

{"Name":"Nash", "Profession":"Economist"}

{"Name":"Euler", "Profession":"Mathematician"}

{"Name":"Bohm", "Profession":"Physicist"}

{"Name":"Galileo", "Profession":"Astrophysicist"}

{"Name":"Lagrange", "Profession":"Mathematician"}

{"Name":"Gauss", "Profession":"Mathematician"}

{"Name":"Thales", "Profession":"Mathematician"}

...

Billions

of

objects

8888

Point queries

{"Name":"Einstein", "Profession":"Physicist"}

{"Name":"Gödel", "Profession":"Mathematician"}

{"Name":"Ramanujan", "Profession":"Mathematician "}

{"Name":"Pythagoras", "Profession":"Mathematician "}

{"Name":"Turing", "Profession":"Computer Scientist"}

{"Name":"Church", "Profession":"Computer Scientist"}

{"Name":"Nash", "Profession":"Economist"}

{"Name":"Euler", "Profession":"Mathematician"}{"Name":"Bohm", "Profession":"Physicist"}

{"Name":"Galileo", "Profession":"Astrophysicist"}

{"Name":"Lagrange", "Profession":"Mathematician"}

{"Name":"Gauss", "Profession":"Mathematician"}

{"Name":"Thales", "Profession":"Mathematician"}

...

find("Name":"Euler"})

8989

Not highly-filtering queries

{"Name":"Einstein", "Profession":"Physicist"}

{"Name":"Gödel", "Profession":"Mathematician"}

{"Name":"Ramanujan", "Profession":"Mathematician "}{"Name":"Pythagoras", "Profession":"Mathematician "}

{"Name":"Turing", "Profession":"Computer Scientist"}

{"Name":"Church", "Profession":"Computer Scientist"}

{"Name":"Nash", "Profession":"Economist"}

{"Name":"Euler", "Profession":"Mathematician"}{"Name":"Bohm", "Profession":"Physicist"}

{"Name":"Galileo", "Profession":"Astrophysicist"}

{"Name":"Lagrange", "Profession":"Mathematician"}

{"Name":"Gauss", "Profession":"Mathematician"}

{"Name":"Thales", "Profession":"Mathematician"}...

find("Profession":"Mathematician"})

9090

Range queries

{"Name":"Einstein", "Year":1879}

{"Name":"Gödel", "Year":1906}

{"Name":"Ramanujan", "Year":1887}

{"Name":"Pythagoras", "Year":-570}

{"Name":"Turing", "Year":1912}

{"Name":"Church", "Year":1903}

{"Name":"Nash", "Year":1928}

{"Name":"Euler", "Year":1707}

{"Name":"Bohm", "Year":1917}

{"Name":"Galileo", "Year":1564}

{"Name":"Lagrange", "Year":1736}

{"Name":"Gauss", "Year":1777}

{"Name":"Thales", "Year":-624}

...

find("Year":{"$gte":1900})

9191

But... How can

we make

this super-

fast?

9292

Indices

yellow

orange

red

green

brown

{

Name: "Apple", "Color": [ "green", "red" ]

}

{

Name: "Orange", "Color": [ "orange" ]

}

{

Name: "Banana", "Color": [ "yellow" ]

}

{

Name: "Kiwi", "Color": [ "brown", "green" ]

}

{

Name: "Ananas", "Color": [ "yellow" ]

}

9393

Indices

yellow

orange

red

green

brown

{

Name: "Apple", "Color": [ "green", "red" ]

}

{

Name: "Orange", "Color": [ "orange" ]

}

{

Name: "Banana", "Color": [ "yellow" ]

}

{

Name: "Kiwi", "Color": [ "brown", "green" ]

}

{

Name: "Ananas", "Color": [ "yellow" ]

}

9494

Hash indices (the fastest)

94

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

9595

Hash indices (the fastest)

95

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

db.scientists.createIndex({

"Century" : "hash"

})

Scientists

9696

Hash indices (the fastest)

96

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

9797

Hash indices (the fastest)

97

Value Records

20h(20)=0{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

9898

Hash indices (the fastest)

98

Value Records

20h(20)=0{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

9999

Hash indices (the fastest)

99

Value Records

20h(20)=0{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

100100

Hash indices (the fastest)

100

Value Records

20

19h(19)=4

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

101101

Hash indices (the fastest)

101

Value Records

20

-4

19

h(-4)=3

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

102102

Hash indices (the fastest)

102

Value Records

20

-6

-4

19

h(-6)=1

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

103103

Hash indices (the fastest)

103

Value Records

20

-6

-4

19

h(20)=0{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

104104

Hash indices (the fastest)

104

Value Records

20

-6

-4

19

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

105105

Hash indices (the fastest)

105

Value Records

20

-6

-4

19

db.scientists.find({"Century":19}

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

106106

Hash indices (the fastest)

106

Value Records

20

-6

-4

19h(19)=4

db.scientists.find({"Century":19}

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

107107

Hash indices (the fastest)

107

Value Records

20

-6

-4

19h(19)=4

db.scientists.find({"Century":19}

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

108108

Hash indices (the fastest)

108

Value Records

20

-6

-4

19h(19)=4

db.scientists.find({"Century":19}

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

109109

Limitations of hash indices

No support for range queries

Hash function not perfect in real life

Space requirements for collision avoidance

110110

B+-tree example

almost carefully

fair

is Laertes

most

be

come hour mymerely

it takepossess

that

should

youupon yourthine

timethy to

this

possess

come is merely that thy upon

Disks block access

111111

B+-tree example

almost carefully

fair

is Laertes

most

be

come hour mymerely

it takepossess

that

should

youupon yourthine

timethy to

this

possess

come is merely that thy upon

All leaves at same depth

112112

B+-tree example

almost carefully

fair

is Laertes

most

be

come hour mymerely

it takepossess

that

should

youupon yourthine

timethy to

this

possess

come is merely that thy upon

All non-leaf nodes have between 3 and 5 children

4 4

2

General case: #children between d+1 and 2d+1

113113

B+-tree example

almost carefully

fair

is Laertes

most

be

come hour mymerely

it takepossess

that

should

youupon yourthine

timethy to

this

possess

come is merely that thy upon

4 4

2

But it's fine if the root has less.

114114

B+-tree example

almost carefully

fair

is Laertes

most

be

come hour mymerely

it takepossess

that

should

youupon yourthine

timethy to

this

possess

come is merely that thy upon

Actual values only at the leaves

115115

Warning: intervals!

n trees

n-1 intervals

116116

Warning: intervals!

1 2

3 children

2 keys

1 2 3

5 children

4 keys

4

117117

Insertion

1

118118

Insertion

1 2

119119

Insertion

1 2 3

120120

Insertion

1 2 3 4

121121

Insertion

1 2 3 4 5

More than 4 values!

122122

Insertion

1 2 4 5

4

3 2

3

123123

Insertion

1 2 4 5

4

3 6

124124

Insertion

1 2 4 5

4

3 6 7

125125

Insertion

1 2 4 5

4

3 6 7 8

126126

Insertion

1 2 4 5

4

3 6 7 8

127127

Insertion

1 2 4 5

4

7

7

83 6

128128

Insertion

1 2 4 5

4

7

7

83 6 9

129129

Insertion

1 2

4 7

3 4 5 7 86 9 10

130130

Insertion

1 2

4 7

3 4 5 7 86 9 10 11

131131

Insertion

1 2

4 7

3 4 5 7 86 9 10

10

11

132132

Insertion

1 2

4 7

3

4 57 8

69

10

10

11

13

12

13 14 15 16

133133

Insertion

1 2

4 7

3

4 57 8

69

10

10

11

13

12

13 14 15 16 17

134134

Insertion

1 2

4 7

3

4 57 8

69

10

10

11

13

12

13 14 15 16 17

16

135135

Insertion

1 2

4 7

3

4 57 8

69

10

13

11 12

13 14 15 16 17

16

10

136136

Tree indices (logarithmic)

136

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

137137

Tree indices (logarithmic)

137

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

db.scientists.createIndex({

"Century" : 1

})

2-3 B+-tree

138138

Tree indices (logarithmic)

138

20

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

139139

Tree indices (logarithmic)

139

20

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

140140

Tree indices (logarithmic)

140

19 20

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

141141

Tree indices (logarithmic)

141

19 20-4

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

142142

Tree indices (logarithmic)

142

-4 19 20

20

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

143143

Tree indices (logarithmic)

143

-4 19 20

20

-6

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

144144

Tree indices (logarithmic)

144

-6 -4 19 20

19

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

145145

Tree indices (logarithmic)

145

-6 -4 19 20

19

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

146146

Tree indices (logarithmic)

146

db.scientists.find({"Century":{"$gte:19}}

-6 -4 19 20

19

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

147147

Tree indices (logarithmic)

147

-6 -4 19 20

19

db.scientists.find({"Century":{"$gte:19}}

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

148148

Tree indices (logarithmic)

148

-6 -4 19 20

19

db.scientists.find({"Century":{"$gte:19}}

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

149149

Tree indices (logarithmic)

149

-6 -4 19 20

19

db.scientists.find({"Century":{"$gte:19}}

{"Name":{"F":"Albert","L":"Einstein"},"Country":"Switzerland","Century":20}

{"Name":"Gödel","Country":"Austria","Century":20}

{"Name":"Ramanujan","Country":"India","Century":19}

{"Name":"Euclid","Country":"Greece","Century":-4}

{"Name":"Pythagoras","Country":"Greece","Century":-6}

{"Name":"Turing","Country":"UK","Century":20}

Scientists

150150

Indices

150

Primary: _id

Secondary: other fields

151151

Query with no indices

151

Scan/filter in memory

152152

Query with indices

153153

Query with indices

153

Prefilter with index

154154

Query with indices

154

More scan/filter in memory

Prefilter with index

155155

Index creation: hash

155

db.scientists.createIndex({"Name.Last" : "hash"

})

156156

Index creation: hash

156

db.scientists.createIndex({"Name.Last" : "hash"

})

db.scientists.find({

"Name.Last" : "Einstein"

})

Index

Query

157157

Index creation: hash

157

db.scientists.createIndex({"Name.Profession" : "hash"

})

db.scientists.find({

"Profession" : "Physicist",

"Theories" : "Relativity"

})

Index

Query

Post-filtering

158158

Index creation: compound (only B+-tree!)

158

db.scientists.createIndex({"Birth" : 1,

"Death" : 1

})

db.scientists.find({"Birth" : 1887,

"Death" : 1946

})

Index

Query

159159

Index creation: compound (only B+-tree!)

159

db.scientists.createIndex({"Birth" : 1,

"Death" : -1

})

db.scientists.find({"Birth" : 1887,

"Death" : 1946

})

Index

Query

descending

160160

Index creation: range

160

db.scientists.createIndex({"Birth" : 1

})

db.scientists.find({"Birth" : { "$gte": 1946 }

})

Index

Query

161161

Index creation: range

161

db.scientists.createIndex({"Birth" : 1

})

db.scientists.find({"Birth" : { "$gte": 1946 },"Death" : 1998

})

Index

Query

Post-filtering

162162

Index creation: compound (only B+-tree!)

162

db.scientists.createIndex({"Birth" : 1,

"Death" : -1

})

db.scientists.find({"Birth" : 1887

})

Index

Query

163163

Index creation: compound (only B+-tree!)

163

db.scientists.createIndex({"Birth" : 1,"Death" : -1

})

db.scientists.find({"Birth" : { "$gte" : 1980 }

})

Index

Query

164164

Index creation: compound (only B+-tree!)

164

db.scientists.createIndex({"Birth" : 1,

"Death" : -1

})

db.scientists.find({"Death" : 1887

})

Index

Query

Post-filtering (why?)

165165

Index creation: prefixes are implied

165

{

"Birth date" : 1,

"Death date" : -1,

"Name.Last" : 1

}{"Birth date" : 1,

"Death date" : -1

}

{

"Birth date" : 1

}