Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Introduction to NoSQL databases
Applications and Web platformsExponential growth of the amount of Data
x2 / 2 yearsUnprecedent management of this volume• Need distribution:
• Computation• Data
• Clusters: Huge number of serversEx: Google, Amazon, FacebookGoogle DataCenter :
• 5000 servers/data center, ~1M de serversFacebook :
• 1 PetaBytes of data
Big Data Context
Standard Data Warehousing Methodology
• VolumeDesigned to store GB/TBBut needs PB (maybe EB)
• VarietyHeterogeneous dataVariable types Text, semi-structured
• VelocityFast data production
DBMS vs the Problem of 3V
DBMS vs the Problem of 3V
• Pros DBMS▫ Joins between tables▫ SQL: Rich query language▫ Constraints: integrity, data types, normal forms
• Cons for Distribution▫ How to partition/place data?▫ How make joins?▫ How to manage fault tolerance?
DBMS vs Distribution
Distributed systems: BASE• Basically Available :
• Any request => An answer• Even in a changing state
• Soft-state : • Opposite to Durability.• System’s state (servers or data) could
change over time (without any update)• Eventually consistent :
• With time, data can be consistent• Updates have to be propagated
ACID vs BASE
0 141 2 3 4 5 6 7 8 9 10 11 12 13
ACID BASEAtomicityConsistencyIsolationDurability
Basically AvailableSoft-StateEventually Consistent
NoSQLSGBDR VS
Local DBMS: ACID transactions• Atomicity: integral completion or none• Consistency: consistent at start and end• Isolation: no communication between them• Durability: an operation cannot be reversed
NoSQL : Not Only SQL• New data storage/management approach• Scales up the system (through distribution)• Complex metadata management (schemaless)
Do not substitute DBMS!!NoSQL’s target:• Very huge volume of data (PetaBytes)• Very short response time• Consistency is not mandatory
What is NoSQL?
• No more relations• No schema (hardly data types)• Dynamic schemaØ Collections
• No more tuples• Nesting• ArraysØ JSON documents
• Distribution• Parallelism (abstraction of Map/Reduce, not cloud computing)
• Data replication• Availability vs Consistency• No more transactionsØ Few writes, many reads
NoSQL characteristics
• Files are segmented in chunks (64MB, 256MB…)• Chunks are distributed in the cluster• Data are placed according to a sharding strategy
• 3 types of technics of sharding:1. HDFS: Resource allocation based (racks, switches, datacenter)
• ++ Fault tolerance and massive computations2. Clustered index: Tree-based structure (sort)
• ++ Physical data clustering and dynamicity3. DHT: Hash-based structure
• ++ elasticity and self-management
Sharding: Data Distribution Strategies
Resources allocation strategy• Distributed file system• Rely on servers load balancing• Dedicated to fault tolerance• Dynamic allocation and optimization
Sharding with HDFS
(1) See Hadoop Distributed File System http://chewbii.com/transparents-hdfs-hadoop/
Sharding with HDFSExample
NameNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
NameNode
DataNode
DataNodeRack Rack
DataNode
DataNode
DataNode
DataNode
DataNode
DataNodeRack
DataNode
NameNode
DataNode
DataNode
switch
switch
switch
Rack Rack
DataNode
DataNode
DataNode
DataNodeChunk 3
DataNode
DataNodeRack
DataNode
NameNode
Chunk 1
DataNodeChunk 4
DataNode
Chunk 2
Chunk 5
Chunk 6
switch
switch
switch
Rack Rack
DataNode
DataNode
DataNode
DataNodeChunk 3
DataNode
DataNodeRack
DataNode
NameNode
Chunk 1 Chunk 2
DataNodeChunk 4
Chunk 5DataNode
Chunk 2
Chunk 1 Chunk 3Chunk 4
Chunk 6Chunk 5
Chunk 6
switch
switch
switch
Rack Rack
DataNode
DataNode
DataNode
DataNodeChunk 3
DataNode
DataNodeRack
DataNode
NameNode
Chunk 1 Chunk 2
DataNode
Chunk 5
Chunk 4Chunk 1
Chunk 2
Chunk 3
Chunk 5DataNode
Chunk 2
Chunk 1 Chunk 3
Chunk 4
Chunk 4
Chunk 6Chunk 5
Chunk 6
Chunk 6
switch
switch
switch
Rack Rack
DataNode
DataNode
DataNode
DataNodeChunk 3
DataNode
DataNodeRack
DataNode
NameNode
Chunk 1
Secondary NameNode
Chunk 2
DataNode
Chunk 5
Chunk 4Chunk 1
Chunk 2
Chunk 3
Chunk 5DataNode
Chunk 2
Chunk 1 Chunk 3
Chunk 4
Chunk 4
Chunk 6Chunk 5
Chunk 6
Chunk 6
switch
switch
switch
Distributed computation Fault Tolerance
Sharding with HDFSNoSQL solutions
Distributed clustered index• Sorted data according to a key
• Primary key• Build chunks (default 256Mo)• Distribute chunks• Replicate chunks
�Range queries and group by queries⚠Only one key can be chosen
Sharding with Clustered index
(2) See « index non-dense » http://chewbii.com/videos-optimisation-bases-de-donnees/ (Vidéo 4)
Sharding with Clustered IndexExample
Chunk-inf -> 10 000
Chunk10 001 -> 20 000
Chunk20 001 -> 30 000
Chunk30 001 -> 40 000
Chunk40 001 -> 50 000
Chunk50 001 -> 60 000
Chunk60 001 -> 70 000
Chunk70 001 -> 80 000
Chunk80 001 —> inf
Noeud
Chunk-inf -> 10 000
Chunk10 001 -> 20 000
Chunk20 001 -> 30 000
Chunk30 001 -> 40 000
Chunk40 001 -> 50 000
Chunk50 001 -> 60 000
Chunk60 001 -> 70 000
Chunk70 001 -> 80 000
Chunk80 001 —> inf
Noeud Noeud Noeud NoeudNoeud + réplicas
Chunk-inf -> 10 000
Chunk10 001 -> 20 000
Chunk20 001 -> 30 000
Chunk30 001 -> 40 000
Chunk40 001 -> 50 000
Chunk50 001 -> 60 000
Chunk60 001 -> 70 000
Chunk70 001 -> 80 000
Chunk80 001 —> inf
Noeud + réplicas Noeud + réplicas Noeud + réplicas Noeud + répl.Noeud + réplicas
Feuille-inf -> 30 000
Feuille60 001 -> +inf
Feuille30 001 -> 60 000
Chunk-inf -> 10 000
Chunk10 001 -> 20 000
Chunk20 001 -> 30 000
Chunk30 001 -> 40 000
Chunk40 001 -> 50 000
Chunk50 001 -> 60 000
Chunk60 001 -> 70 000
Chunk70 001 -> 80 000
Chunk80 001 —> inf
Noeud + réplicas Noeud + réplicas Noeud + réplicas Noeud + répl.Noeud + réplicas
Racine
Feuille-inf -> 30 000
Feuille60 001 -> +inf
Feuille30 001 -> 60 000
Chunk-inf -> 10 000
Chunk10 001 -> 20 000
Chunk20 001 -> 30 000
Chunk30 001 -> 40 000
Chunk40 001 -> 50 000
Chunk50 001 -> 60 000
Chunk60 001 -> 70 000
Chunk70 001 -> 80 000
Chunk80 001 —> inf
Noeud + réplicas Noeud + réplicas Noeud + réplicas Noeud + répl.Noeud + réplicas
RouteurRacine
Feuille-inf -> 30 000
Feuille60 001 -> +inf
Feuille30 001 -> 60 000
Chunk-inf -> 10 000
Chunk10 001 -> 20 000
Chunk20 001 -> 30 000
Chunk30 001 -> 40 000
Chunk40 001 -> 50 000
Chunk50 001 -> 60 000
Chunk60 001 -> 70 000
Chunk70 001 -> 80 000
Chunk80 001 —> inf
Noeud + réplicas Noeud + réplicas Noeud + réplicas Noeud + répl.Noeud + réplicas
Routeur x3Racine
Feuille-inf -> 30 000
Feuille60 001 -> +inf
Feuille30 001 -> 60 000
Chunk-inf -> 10 000
Chunk10 001 -> 20 000
Chunk20 001 -> 30 000
Chunk30 001 -> 40 000
Chunk40 001 -> 50 000
Chunk50 001 -> 60 000
Chunk60 001 -> 70 000
Chunk70 001 -> 80 000
Chunk80 001 —> inf
Noeud + réplicas Noeud + réplicas Noeud + réplicas Noeud + répl.
Clustering Complex queries
Sharding with Clustered IndexSolutions
Distributed Hash Table (DHT3)• Ring of virtual servers
• Max 264
• Unique hash table: splitted & distributed• Routing• Efficiency• Self-Management (no main server)
Sharding with DHT
(3) See DHT http://chewbii.com/hachagedynamique/ (Vidéo 4 & 5)
Sharding with DHTExample
0 = 264 = 4 x 262
262
263 = 2 x 262
3 x 262
0 = 264 = 4 x 262
Portion du
serveur S1
Portion duserveur S2
Portion du serveur S3
Port
ion
duse
rveu
r S4
Portion du
serveur S5
S3
S4
S5
S1
S2
262
263 = 2 x 262
3 x 262
0 = 264 = 4 x 262
Portion du
serveur S1
Portion duserveur S2
Portion du serveur S3
Port
ion
duse
rveu
r S4
Portion du
serveur S5
S3
S4
S5
S1
S2
262
263 = 2 x 262
3 x 262
d1
d2
d3
d4
d5
d6
d7
0 = 264 = 4 x 262
Portion du
serveur S1
Portion duserveur S2
Portion du serveur S3
Port
ion
duse
rveu
r S4
Portion du
serveur S5
S3
S4
S5
S1
S2
262
263 = 2 x 262
3 x 262
d1
d2
d3
d4
d5
d6 Chunk S1d1
d5 d2 d4réplicas
resp
Chunk S2d5 d2
d4réplicas
resp
d3
Chunk S3d4
réplicas
resp
d3 d6
d7
d7
Chunk S4d3
réplicas
resp
d6 d1
Chunk S5d6
réplicas
resp
d1 d5 d2
d7
d7
0 = 264 = 4 x 262
Portion du
serveur S1
Portion duserveur S2
Portion du serveur S3
Port
ion
duse
rveu
r S4
Portion du
serveur S5
S3
S4
S5
S1
S2
262
263 = 2 x 262
3 x 262
d1
d2
d3
d4
d5
d6 Chunk S1d1
d5 d2 d4réplicas
resp
Chunk S2d5 d2
d4réplicas
resp
d3
Chunk S3d4
réplicas
resp
d3 d6
d7
d7
Chunk S4d3
réplicas
resp
d6 d1
Chunk S5d6
réplicas
resp
d1 d5 d2
d7
d7
0 = 264 = 4 x 262d3 ?
Portion du
serveur S1
Portion duserveur S2
Portion du serveur S3
Port
ion
duse
rveu
r S4
Portion du
serveur S5
S3
S4
S5
S1
S2
262
263 = 2 x 262
3 x 262
d1
d2
d3
d4
d5
d6 Chunk S1d1
d5 d2 d4réplicas
resp
Chunk S2d5 d2
d4réplicas
resp
d3
Chunk S3d4
réplicas
resp
d3 d6
d7
d7
Chunk S4d3
réplicas
resp
d6 d1
Chunk S5d6
réplicas
resp
d1 d5 d2
d7
d7
0 = 264 = 4 x 262d3 ?
Portion du
serveur S1
Portion duserveur S2
Portion du serveur S3
Port
ion
duse
rveu
r S4
Portion du
serveur S5
S3
S4
S5
S1
S2
262
263 = 2 x 262
3 x 262
d1
d2
d3
d4
d5
d6 Chunk S1d1
d5 d2 d4réplicas
resp
Chunk S2d5 d2
d4réplicas
resp
d3
Chunk S3d4
réplicas
resp
d3 d6
d7
d7
Chunk S4d3
réplicas
resp
d6 d1
Chunk S5d6
réplicas
resp
d1 d5 d2
d7
d7
0 = 264 = 4 x 262d3 ?
Elasticity Self-Management
Sharding with DHTSolutions
NoSQL Data Type Families
Nicolastype:prof
RégisLuc
type:resp formation
Célinetype:prof
CNAMadresse:…
OCadresse:…
CentraleSupélec
employeur
employeur
employeur
vacation
Paris Gif-sur-Yvette
siège social siège social siège social
vacation employeur
vacation
vacation
Key-Value stores
Document-orientedGraph oriented
Column-oriented
Nicolas Gaël Philippe Céline
{ "_id": "Nicolas", "type": "prof", "lieu": { "nom": "ESILV", "adresse":"12 avenue Léonard de Vinci", "ville": "Paris La Défense" }, "spec": ["BDD","NoSQL"], "intérêts": ["BZH", "Star Wars"]}
Keys
documents
{ "_id": "Gaël", "lieu": { "nom": "ESILV", "adresse":"12 avenue Léonard de Vinci", "ville": "Paris La Défense" }, "spec": ["Signal Processing", "Tourism"], "intérêts": ["Music"]}
{ "_id": "Philippe", "type": "prof", "lieu": { "nom": "Cnam", "adresse":"2 rue Conté", "ville": "Paris" }, "spec": ["BDD", "NoSQL""Music"]}
{ "_id": "Céline", "type": "prof", "lieu": { "nom": "CentraleSupelec", "adresse":"rue Joliot-Cury", "ville": "Gif-sur-Yvette » }, "spec": ["Ontologie", "Logique formelle", "Visualisation"]}
Nicolas Gaël Philippe Céline
type: profemployer: ESILV
spec: BDD,NoSQLhobbies: BZH, Star Wars
type: resp formationemployer: ESILV
spec: Signal Processing,Visualization
hobbies: Tourism, Music
type:profemployer:CNAM
spec: BDD, NoSQL, Music
type: profemployer: CentraleSupelec
spec: Ontology, Formal Logic, Visualization
Keys
values
status
profresp
formationprof
prof
id
Nicolas
Gaël
Philippe
Céline
employer
ESILV
ESILV
CNAM
CentraleSupelec
id
Nicolas
Gaël
Philippe
Céline
spec
BDD
Signal Processing
BDD
Ontology
id
Nicolas
Gaël
Philippe
Céline
NoSQLNicolas
VisualizationGaël
NoSQLPhilippe
MusicPhilippe
Formal LogicCéline
VisualizationCéline
id
Nicolas
Gaël
hobbies
BZH
Tourism
Nicolas Star Wars
Gaël Music
Similar to a distributed “HashMap”
Key + Value:• No fixed schema on values (strings, object, integer, binaries…)
Drawbacks:• No structures nor typing• No structured-based queries
Key-Value Store
Key-Value StoreExample
22
Relational database
status employer spec
prof ESILV BDD, NoSQLresp
formation ESILV Signal Processing, Visualization
prof CNAM BDD, NoSQL, Music
prof CentraleSupelec Ontology, Formal Logic, Visualization
id
Nicolas
Gaël
Philippe
Céline
hobbies
BZH, Star Wars
Tourism, Music
Nicolas Gaël Philippe Céline
type: profemployer: ESILV
spec: BDD,NoSQLhobbies: BZH, Star Wars
type: resp formationemployer: ESILV
spec: Signal Processing,Visualization
hobbies: Tourism, Music
type:profemployer:CNAM
spec: BDD, NoSQL, Music
type: profemployer: CentraleSupelec
spec: Ontology, Formal Logic, Visualization
Keys
values
Nicolas Gaël Philippe Céline
type: profemployer: ESILV
spec: BDD,NoSQLhobbies: BZH, Star Wars
type: resp formationemployer: ESILV
spec: Signal Processing,Visualization
hobbies: Tourism, Music
type:profemployer:CNAM
spec: BDD, NoSQL, Music
type: profemployer: CentraleSupelec
spec: Ontology, Formal Logic, Visualization
Keys
values
Nicolas Gaël Philippe Céline
type: profemployer: ESILV
spec: BDD,NoSQLhobbies: BZH, Star Wars
type: resp formationemployer: ESILV
spec: Signal Processing,Visualization
hobbies: Tourism, Music
type:profemployer:CNAM
spec: BDD, NoSQL, Music
type: profemployer: CentraleSupelec
spec: Ontology, Formal Logic, Visualization
Keys
values
Nicolas Gaël Philippe Céline
type: profemployer: ESILV
spec: BDD,NoSQLhobbies: BZH, Star Wars
type: resp formationemployer: ESILV
spec: Signal Processing,Visualization
hobbies: Tourism, Music
type:profemployer:CNAM
spec: BDD, NoSQL, Music
type: profemployer: CentraleSupelec
spec: Ontology, Formal Logic, Visualization
Keys
values
Nicolas Gaël Philippe Céline
type: profemployer: ESILV
spec: BDD,NoSQLhobbies: BZH, Star Wars
type: resp formationemployer: ESILV
spec: Signal Processing,Visualization
hobbies: Tourism, Music
type:profemployer:CNAM
spec: BDD, NoSQL, Music
type: profemployer: CentraleSupelec
spec: Ontology, Formal Logic, Visualization
Keys
values
Nicolas Gaël Philippe Céline
type: profemployer: ESILV
spec: BDD,NoSQLhobbies: BZH, Star Wars
type: resp formationemployer: ESILV
spec: Signal Processing,Visualization
hobbies: Tourism, Music
type:profemployer:CNAM
spec: BDD, NoSQL, Music
type: profemployer: CentraleSupelec
spec: Ontology, Formal Logic, Visualization
Keys
values
CRUD• CREATE ( key, value)
• CREATE ("Nicolas", "type:'prof',employer:'ESILV',spec:'BDD,NoSQL',hobbies:'BZH,Star Wars' ") à OK• READ( key)
• READ("Nicolas") à "type:'prof',employer:'ESILV',spec:'BDD,NoSQL',hobbies:'BZH,Star Wars' "• UPDATE( key, value)
• UPDATE("Nicolas", "type:'prof',employer:ESILV',spec:'BDD,NoSQL' ") à OK• DELETE( key)
• DELETE("Nicolas") à OK
Key-Value StoreQueries
23
Nicolas Gaël Philippe Céline
type: profemployer: ESILV
spec: BDD,NoSQLhobbies: BZH, Star Wars
type: resp formationemployer: ESILV
spec: Signal Processing,Visualization
hobbies: Tourism, Music
type:profemployer:CNAM
spec: BDD, NoSQL, Music
type: profemployer: CentraleSupelec
spec: Ontology, Formal Logic, Visualization
Keys
values
Nicolas Gaël Philippe Céline
type: profemployer: ESILV
spec: BDD,NoSQLhobbies: BZH, Star Wars
type: resp formationemployer: ESILV
spec: Signal Processing,Visualization
hobbies: Tourism, Music
type:profemployer:CNAM
spec: BDD, NoSQL, Music
type: profemployer: CentraleSupelec
spec: Ontology, Formal Logic, Visualization
Keys
values
Nicolas Gaël Philippe Céline
type: profemployer: ESILV
spec: BDD,NoSQLhobbies: BZH, Star Wars
type: resp formationemployer: ESILV
spec: Signal Processing,Visualization
hobbies: Tourism, Music
type:profemployer:CNAM
spec: BDD, NoSQL, Music
type: profemployer: CentraleSupelec
spec: Ontology, Formal Logic, Visualization
Keys
values
Nicolas Gaël Philippe Céline
type: profemployer: ESILV
spec: BDD,NoSQL
type: resp formationemployer: ESILV
spec: Signal Processing,Visualization
hobbies: Tourism, Music
type:profemployer:CNAM
spec: BDD, NoSQL, Music
type: profemployer: CentraleSupelec
spec: Ontology, Formal Logic, Visualization
Keys
values
• Web site logs store• Web site/DB cache• User profil management• Sensors status• E-commerce bucket storage
Key-Value StoreUse cases
Efficiency Easy to set up
Key-value StoreSolutions
25
Column-based storage• DBMS: tuples (lines/rows) are stored in a whole• Column-oriented: attributes are separated in column-families and stored independently• Efficient queries on columns (aggregates)
Easy to insert a new column• Dynamic schema
Column-Oriented Store
status employer spec
prof ESILV BDD, NoSQLresp
formation ESILV Signal Processing, Visualization
prof CNAM BDD, NoSQL, Music
prof CentraleSupelec Ontology, Formal Logic, Visualization
id
Nicolas
Gaël
Philippe
Céline
hobbies
BZH, Star Wars
Tourism, Music
Column-Oriented StoresExample
27
status
profresp
formationprof
prof
id
Nicolas
Gaël
Philippe
Céline
employer
ESILV
ESILV
CNAM
CentraleSupelec
id
Nicolas
Gaël
Philippe
Céline
spec
BDD
Signal Processing
BDD
Ontology
id
Nicolas
Gaël
Philippe
Céline
NoSQLNicolas
VisualizationGaël
NoSQLPhilippe
MusicPhilippe
Formal LogicCéline
VisualizationCéline
id
Nicolas
Gaël
hobbies
BZH
Tourism
Nicolas Star Wars
Gaël Music
status
profresp
formationprof
prof
id
Nicolas
Gaël
Philippe
Céline
employer
ESILV
ESILV
CNAM
CentraleSupelec
id
Nicolas
Gaël
Philippe
Céline
spec
BDD
Signal Processing
BDD
Ontology
id
Nicolas
Gaël
Philippe
Céline
NoSQLNicolas
VisualizationGaël
NoSQLPhilippe
MusicPhilippe
Formal LogicCéline
VisualizationCéline
id
Nicolas
Gaël
hobbies
BZH
Tourism
Nicolas Star Wars
Gaël Music
status
profresp
formationprof
prof
id
Nicolas
Gaël
Philippe
Céline
employer
ESILV
ESILV
CNAM
CentraleSupelec
id
Nicolas
Gaël
Philippe
Céline
spec
BDD
Signal Processing
BDD
Ontology
id
Nicolas
Gaël
Philippe
Céline
NoSQLNicolas
VisualizationGaël
NoSQLPhilippe
MusicPhilippe
Formal LogicCéline
VisualizationCéline
id
Nicolas
Gaël
hobbies
BZH
Tourism
Nicolas Star Wars
Gaël Music
status
profresp
formationprof
prof
id
Nicolas
Gaël
Philippe
Céline
employer
ESILV
ESILV
CNAM
CentraleSupelec
id
Nicolas
Gaël
Philippe
Céline
spec
BDD
Signal Processing
BDD
Ontology
id
Nicolas
Gaël
Philippe
Céline
NoSQLNicolas
VisualizationGaël
NoSQLPhilippe
MusicPhilippe
Formal LogicCéline
VisualizationCéline
id
Nicolas
Gaël
hobbies
BZH
Tourism
Nicolas Star Wars
Gaël Music
status
profresp
formationprof
prof
id
Nicolas
Gaël
Philippe
Céline
employer
ESILV
ESILV
CNAM
CentraleSupelec
id
Nicolas
Gaël
Philippe
Céline
spec
BDD
Signal Processing
BDD
Ontology
id
Nicolas
Gaël
Philippe
Céline
NoSQLNicolas
VisualizationGaël
NoSQLPhilippe
MusicPhilippe
Formal LogicCéline
VisualizationCéline
id
Nicolas
Gaël
hobbies
BZH
Tourism
Nicolas Star Wars
Gaël Music
Column-oriented queries• How many professors (status) are at
ESILV (employer)
Column-Oriented StoresQueries
28
status
profresp
formationprof
prof
id
Nicolas
Gaël
Philippe
Céline
employer
ESILV
ESILV
CNAM
CentraleSupelec
id
Nicolas
Gaël
Philippe
Céline
spec
BDD
Signal Processing
BDD
Ontology
id
Nicolas
Gaël
Philippe
Céline
NoSQLNicolas
VisualizationGaël
NoSQLPhilippe
MusicPhilippe
Formal LogicCéline
VisualizationCéline
id
Nicolas
Gaël
hobbies
BZH
Tourism
Nicolas Star Wars
Gaël Music
status
profresp
formationprof
prof
id
Nicolas
Gaël
Philippe
Céline
employer
ESILV
ESILV
CNAM
CentraleSupelec
id
Nicolas
Gaël
Philippe
Céline
spec
BDD
Signal Processing
BDD
Ontology
id
Nicolas
Gaël
Philippe
Céline
NoSQLNicolas
VisualizationGaël
NoSQLPhilippe
MusicPhilippe
Formal LogicCéline
VisualizationCéline
id
Nicolas
Gaël
hobbies
BZH
Tourism
Nicolas Star Wars
Gaël Music
status
profresp
formationprof
prof
id
Nicolas
Gaël
Philippe
Céline
employer
ESILV
ESILV
CNAM
CentraleSupelec
id
Nicolas
Gaël
Philippe
Céline
spec
BDD
Signal Processing
BDD
Ontology
id
Nicolas
Gaël
Philippe
Céline
NoSQLNicolas
VisualizationGaël
NoSQLPhilippe
MusicPhilippe
Formal LogicCéline
VisualizationCéline
id
Nicolas
Gaël
hobbies
BZH
Tourism
Nicolas Star Wars
Gaël Music
• Statistics on online polls• Logging• Category searchs on products (ie. Ebay)• Reporting
Column-Oriented StoresUse cases
Aggregates Correlations
Column-Oriented StoresSolutions
30
Based on the key-value store• Add semi-structured data (JSON/XML)• Keys indexing• Lists of values, nesting• Allow fusion of concepts
HTTP API• More complex than CRUD• Rich queries (database)
Document-Oriented Stores
Document-Oriented StoresExample
32
Nicolas Gaël Philippe Céline
{ "_id": "Nicolas", "type": "prof", "lieu": { "nom": "ESILV", "adresse":"12 avenue Léonard de Vinci", "ville": "Paris La Défense" }, "spec": ["BDD","NoSQL"], "intérêts": ["BZH", "Star Wars"]}
Keys
documents
{ "_id": "Gaël", "lieu": { "nom": "ESILV", "adresse":"12 avenue Léonard de Vinci", "ville": "Paris La Défense" }, "spec": ["Signal Processing", "Tourism"], "intérêts": ["Music"]}
{ "_id": "Philippe", "type": "prof", "lieu": { "nom": "Cnam", "adresse":"2 rue Conté", "ville": "Paris" }, "spec": ["BDD", "NoSQL""Music"]}
{ "_id": "Céline", "type": "prof", "lieu": { "nom": "CentraleSupelec", "adresse":"rue Joliot-Cury", "ville": "Gif-sur-Yvette » }, "spec": ["Ontologie", "Logique formelle", "Visualisation"]}
Nicolas Gaël Philippe Céline
{ "_id": "Nicolas", "status": "prof", "employer": { "name": "ESILV", "address":"12 avenue Léonard de Vinci", "city": "Paris La Défense" }, "spec": ["BDD","NoSQL"], "hobbies": ["BZH", "Star Wars"]}
Keys
documents
{ "_id": "Gaël", "status" : "resp formation "employer": { "name": "ESILV", "address":"12 avenue Léonard de Vinci", "city": "Paris La Défense" }, "spec": ["Signal Processing", "Visualization"], "hobbies": ["Tourism", "Music"]}
{ "_id": "Philippe", "status": "prof", "employer": { "name": "Cnam", "address":"2 rue Conté", "city": "Paris" }, "spec": ["BDD", "NoSQL", "Music"]}
{ "_id": "Céline", "status": "prof", "employer": { "name": "CentraleSupelec", "address":"rue Joliot-Cury", "city": "Gif-sur-Yvette » }, "spec": ["Ontology", "Formal Logic", "Visualization"]}
Nicolas Gaël Philippe Céline
{ "_id": "Nicolas", "status": "prof", "employer": { "name": "ESILV", "address":"12 avenue Léonard de Vinci", "city": "Paris La Défense" }, "spec": ["BDD","NoSQL"], "hobbies": ["BZH", "Star Wars"]}
Keys
documents
{ "_id": "Gaël", "status" : "resp formation "employer": { "name": "ESILV", "address":"12 avenue Léonard de Vinci", "city": "Paris La Défense" }, "spec": ["Signal Processing", "Visualization"], "hobbies": ["Tourism", "Music"]}
{ "_id": "Philippe", "status": "prof", "employer": { "name": "Cnam", "address":"2 rue Conté", "city": "Paris" }, "spec": ["BDD", "NoSQL", "Music"]}
{ "_id": "Céline", "status": "prof", "employer": { "name": "CentraleSupelec", "address":"rue Joliot-Cury", "city": "Gif-sur-Yvette » }, "spec": ["Ontology", "Formal Logic", "Visualization"]}
Nicolas Gaël Philippe Céline
{ "_id": "Nicolas", "status": "prof", "employer": { "name": "ESILV", "address":"12 avenue Léonard de Vinci", "city": "Paris La Défense" }, "spec": ["BDD","NoSQL"], "hobbies": ["BZH", "Star Wars"]}
Keys
documents
{ "_id": "Gaël", "status" : "resp formation "employer": { "name": "ESILV", "address":"12 avenue Léonard de Vinci", "city": "Paris La Défense" }, "spec": ["Signal Processing", "Visualization"], "hobbies": ["Tourism", "Music"]}
{ "_id": "Philippe", "status": "prof", "employer": { "name": "Cnam", "address":"2 rue Conté", "city": "Paris" }, "spec": ["BDD", "NoSQL", "Music"]}
{ "_id": "Céline", "status": "prof", "employer": { "name": "CentraleSupelec", "address":"rue Joliot-Cury", "city": "Gif-sur-Yvette » }, "spec": ["Ontology", "Formal Logic", "Visualization"]}
Nicolas Gaël Philippe Céline
{ "_id": "Nicolas", "status": "prof", "employer": { "name": "ESILV", "address":"12 avenue Léonard de Vinci", "city": "Paris La Défense" }, "spec": ["BDD","NoSQL"], "hobbies": ["BZH", "Star Wars"]}
Keys
documents
{ "_id": "Gaël", "status" : "resp formation "employer": { "name": "ESILV", "address":"12 avenue Léonard de Vinci", "city": "Paris La Défense" }, "spec": ["Signal Processing", "Visualization"], "hobbies": ["Tourism", "Music"]}
{ "_id": "Philippe", "status": "prof", "employer": { "name": "Cnam", "address":"2 rue Conté", "city": "Paris" }, "spec": ["BDD", "NoSQL", "Music"]}
{ "_id": "Céline", "status": "prof", "employer": { "name": "CentraleSupelec", "address":"rue Joliot-Cury", "city": "Gif-sur-Yvette » }, "spec": ["Ontology", "Formal Logic", "Visualization"]}
Nicolas Gaël Philippe Céline
{ "_id": "Nicolas", "status": "prof", "employer": { "name": "ESILV", "address":"12 avenue Léonard de Vinci", "city": "Paris La Défense" }, "spec": ["BDD","NoSQL"], "hobbies": ["BZH", "Star Wars"]}
Keys
documents
{ "_id": "Gaël", "status" : "resp formation "employer": { "name": "ESILV", "address":"12 avenue Léonard de Vinci", "city": "Paris La Défense" }, "spec": ["Signal Processing", "Visualization"], "hobbies": ["Tourism", "Music"]}
{ "_id": "Philippe", "status": "prof", "employer": { "name": "Cnam", "address":"2 rue Conté", "city": "Paris" }, "spec": ["BDD", "NoSQL", "Music"]}
{ "_id": "Céline", "status": "prof", "employer": { "name": "CentraleSupelec", "address":"rue Joliot-Cury", "city": "Gif-sur-Yvette » }, "spec": ["Ontology", "Formal Logic", "Visualization"]}
status employer spec
prof ESILV BDD, NoSQLresp
formation ESILV Signal Processing, Visualization
prof CNAM BDD, NoSQL, Music
prof CentraleSupelec Ontology, Formal Logic, Visualization
id
Nicolas
Gaël
Philippe
Céline
hobbies
BZH, Star Wars
Tourism, Music
Nicolas Gaël Philippe Céline
{ "_id": "Nicolas", "status": "prof", "employer": { "name": "ESILV", "address":"12 avenue Léonard de Vinci", "city": "Paris La Défense" }, "spec": ["BDD","NoSQL"], "hobbies": ["BZH", "Star Wars"]}
Keys
documents
{ "_id": "Gaël", "status" : "resp formation "employer": { "name": "ESILV", "address":"12 avenue Léonard de Vinci", "city": "Paris La Défense" }, "spec": ["Signal Processing", "Visualization"], "hobbies": ["Tourism", "Music"]}
{ "_id": "Philippe", "status": "prof", "employer": { "name": "Cnam", "address":"2 rue Conté", "city": "Paris" }, "spec": ["BDD", "NoSQL", "Music"]}
{ "_id": "Céline", "status": "prof", "employer": { "name": "CentraleSupelec", "address":"rue Joliot-Cury", "city": "Gif-sur-Yvette » }, "spec": ["Ontology", "Formal Logic", "Visualization"]}
Manipulations on documents content• Establishment (employer.name) of professors (status) specialized in BDD (in spec)
Document-Oriented StoresQueries
33
• CMS: Content Management Systems• Online libraries• Products management• Software logging• Metadata on multimedia stores
• Collection of complex events• Emails storage• Social networks logging
Document-Oriented StoresUse cases
Richness of queries Manage objects
Document-Oriented StoresSolutions
35
Storage: nodes, relations and properties• Graph Theory• Pattern queries on the graph• Data are loaded on demand• Difficulties for data modeling
Graph-Oriented Stores
status employer spec
prof ESILV BDD, NoSQLresp
formation ESILV Signal Processing, Visualization
prof CNAM BDD, NoSQL, Music
prof CentraleSupelec Ontology, Formal Logic, Visualization
id
Nicolas
Gaël
Philippe
Céline
hobbies
BZH, Star Wars
Tourism, Music
Graph-Oriented StoresExample
37
Nicolas PhilippeGaël CélineNicolastype:prof
Philippetype:prof
Gaëltype:resp
Célinetype:prof
Nicolastype:prof
Philippetype:prof
Gaëltype:resp
Célinetype:prof
CentraleSupelecCnamESILV
Nicolastype:prof
Philippetype:prof
Gaëltype:resp
Célinetype:prof
CentraleSupelecCnam
employer employer employeremployer
ESILV
Nicolastype:prof
Philippetype:prof
Gaëltype:resp
Célinetype:prof
CentraleSupelecCnam
employer
exemployer
vacation employer employeremployer
ESILV
Nicolastype:prof
Philippetype:prof
Gaëltype:resp
Célinetype:prof
CentraleSupelec
Paris Gif-sur-Yvette
Cnam
employer
exemployer
vacation employer employer
location location location
employer
ESILV
Nicolastype:prof
Philippetype:prof
Gaëltype:resp
Célinetype:prof
CentraleSupelec
Paris Gif-sur-Yvette
Cnam
employer
exemployer
vacation employer employer
location location location
employer
ESILV
Pattern queries• Persons who had 2 employers at Paris
Graph-Oriented StoresQueries
38
• Pattern recognition• Recommandation• Fraud detection• SIG (road network, electricty, reachability)
• Graph computations• Shorted path• PageRank• Connexity• Communities detection
• Linked data
Graph-Oriented StoresUse cases
FlockDB
Network Recommandation
Graph-Oriented StoresSolutions
40
NoSQL Community Impact
https://db-engines.com/en/ranking
3 main properties for distributed management1. Consistency:
• A data returns the proper value at any time (coherency)
2. Availability:• Even if a server is down, data is
available3. Partition Tolerance:
• Even if the system is partitioned, a query must have an answer (unless for global failures)
Theorem: A distributed, networked system can have only two of these three properties.
The CAP Theorem[Brewer 2000]
Consistency
Availability
Partition Tolerance
CA AP
CP
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
v1 v1 v1 v1v1
écriture
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
v2v1 v1 v1 v1v1
L1écriture L2
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
v2v1 v1 v1 v1v1
v2 v2L1écriture L2
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
v2v1 v1 v1 v1v1
L1écriture L2 écriture
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
v2v1 v2v1 v1 v1v1
L1écriture L2 L1écriture L2
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
v2v1 v2v1 v1 v1v1
L1écriture L2 L1écriture L2
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
v2v1 v2v1 v1 v1v1
attenteL1écriture L2 L1écriture L2
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
synchrone
v2v1 v2v1 v2v1 v1v1
attente
ack
L1écriture L2 L1écriture L2
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
synchrone
v2v1 v2v1 v2v1 v1v1
v2 v2L1écriture L2 L1écriture L2
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
synchrone
v2v1 v2v1 v2v1 v1v1
L1écriture L2 L1écriture L2
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
synchrone
écriture
v2v1 v2v1 v2v1 v2v1 v1
L1écriture L2 L1écriture L2
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
synchrone
L1écriture L2
v2v1 v2v1 v2v1 v2v1 v1
v2 v1L1écriture L2 L1écriture L2
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
synchrone
L1écriture L2
asynchrone
v2v1 v2v1 v2v1 v2v1 v1
v2 v1L1écriture L2 L1écriture L2
CACohérence + Disponibilité
CPCohérence + Distribution
APDisponibilité + Distribution
synchrone
L1écriture L2
asynchrone
v2v1 v2v1 v2v1 v2v1 v2v1
v2 v1
The CAP TheoremIllustration
43
Availability(Disponibilité)
Consistency(Cohérence)
Partition tolerance(Distribution)
RelationnelOrienté Clé-ValeurOrienté ColonneOrienté DocumentOrienté Graphe
Modèle
Availability(Disponibilité)
Consistency(Cohérence)
Partition tolerance(Distribution)
RelationnelOrienté Clé-ValeurOrienté ColonneOrienté DocumentOrienté Graphe
OracleMySQL
SQLServer
DB2PostgreSQL
Modèle
Availability(Disponibilité)
Consistency(Cohérence)
Partition tolerance(Distribution)
RelationnelOrienté Clé-ValeurOrienté ColonneOrienté DocumentOrienté Graphe
RedisMemcached
CosmosDB
SimpleDB OracleMySQL
SQLServer
DB2PostgreSQL
Modèle
Availability(Disponibilité)
Consistency(Cohérence)
Partition tolerance(Distribution)
RelationnelOrienté Clé-ValeurOrienté ColonneOrienté DocumentOrienté Graphe
RedisMemcached
CosmosDB
SimpleDB
BigTableHBase
ElasticsearchSpark
OracleMySQL
SQLServer
DB2PostgreSQL
Modèle
Availability(Disponibilité)
Consistency(Cohérence)
Partition tolerance(Distribution)
RelationnelOrienté Clé-ValeurOrienté ColonneOrienté DocumentOrienté Graphe
MongoDB
RedisMemcached
CosmosDB
SimpleDB
BigTableHBase
ElasticsearchSpark
OracleMySQL
SQLServer
DB2PostgreSQL
CouchBase DynamoDB
Cassandra
Modèle
Availability(Disponibilité)
Consistency(Cohérence)
Partition tolerance(Distribution)
RelationnelOrienté Clé-ValeurOrienté ColonneOrienté DocumentOrienté Graphe
MongoDB
RedisMemcached
CosmosDB
SimpleDB
BigTableHBase
ElasticsearchSpark
OracleMySQL
SQLServer
DB2PostgreSQL
CouchBase DynamoDB
Cassandra
Neo4jOrientDBFlockDB
Modèle
Availability(Disponibilité)
Consistency(Cohérence)
Partition tolerance(Distribution)
RelationnelOrienté Clé-ValeurOrienté ColonneOrienté DocumentOrienté Graphe
MongoDB*
RedisMemcached*
CosmosDB*
SimpleDB*
BigTableHBase
ElasticsearchSpark
OracleMySQL
SQLServer
DB2PostgreSQL
CouchBase*DynamoDB*
Cassandra*
Neo4jOrientDBFlockDB
* Possibilité de changer le mode de cohérence
Cohérence <-> Disponibilité
Modèle
The CAP TheoremCAP triangle
44
Data synchronization• PAC
• Partitioning/Data movements• Between Availability and Consistency
• ELC• Normal state of the network• Between Latency and Consistency
CAP vs PACELC[Abadi 2012]
• PA/EL• Dynamo, Cassandra, Riak• Always available with a minimum of latency
• PC/EC• VoltDB/H-Store, MegaStore• Consistency whatever happens
• PA/EC• MongoDB• While partitioning: availability• In normal state: consistency
Initially XML used for complex internet communications (Web Services)
• Too verboseJSON (JavaScript Object Notation)
• Lightweight, text-oriented, language independent
• Used for several Web services (Google API, Twitter API)
JSON: Javascript Object Notation
JSON: Javascript Object NotationKey-values & Typing
Key + Value• “lastname” : “Travers”• Keys with quotationsIdentifiers• “ _id ” commonly used• Overwrite already stored ids• Can be automaticaly generated
• Ex MongoDB: "_id" : ObjectId(1234567890)
Objects/Documents• Collection of key/values• { “_id” : 1234,
“lastname” : “Travers”,“firstname” : “Nicolas”,“kind” : 1}
Scalar• String, Integer, float, boolean, null…Documents• Objects {…}List• Arrays [ … ]• No typing
“lessons” : [ “SQL”, 1, 4.2, null, “NoSQL” ]• Can nest documents
“doc” : [ {“test” : 1},{“test” : {“nesting” : 1.0}},{“key” : “text”, “value” : null} ]
{“_id” : 1234,“lastname” : “Travers”, “firstname” : “Nicolas”,“employers” : [
{ “company” : “ESILV”, "starting_date" : “2018-09",“location” : {“street” : “12 avenue Léonard de Vinci”,“city” : “Paris La Défense”, “zip” : 92916},
{ “company” : “CNAM”, "starting_date" : “2007-09", "end_date" : “2018-08",
“location” : {“street” : “2 rue Conté”,“city” : “Paris”, “zip” : 75141},{ “company” : “UCP”, "starting_date" : “2006-09", "end_date" : “2007-08",
“location” : {“street” : “2 rue Conté”,“city” : “Cergy-Pontoise”, “zip” : 91000},{ “company” : “UVSQ”, "starting_date" : “2004-01", "end_date" : “2006-08",
“location” : {“street” : “45 avenue des Etats-Unis”,“city” : “Versailles”, “zip” : 78000},]
},“fields” : [ “DB” , “DB optimization”, “XML”, “NoSQL”, “IR” ],"hobbies" : ["Star Wars", "BZH"]
}
JSON: Javascript Object NotationComplete example