Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
Introduction to NoSQL
Nicolas TraversCNAM – France
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
Schedule&Organization• IntroductiontoNoSQL databases▫ 3V,ACIDvsBASE,families,CAPtheorem,JSon• PresentationofMongoDB▫ Language,distribution,replication,application• PracticeWorks onMongoDB▫ Queries:find+aggregate
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
DBMSvsNoSQL
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
Context• ApplicationsandWebplatforms▫ ExponentialgrowthoftheamountofData(x2/2years)▫ Unprecedent managementofthisvolume
� Needtodistributebothcomputationanddata� Hugenumberofservers� Heterogeneousdata,maybecomplexandoftenlinked
• Ex:▫ Google,Amazon,Facebook▫ GoogleDataCenter :
� 5000servers/datacenter,~1Mdeservers▫ Facebook:
� 1PetaBytes ofdata
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo 8
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
BI:Traditional methods
9
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
Decisional vs3VIncompatibleclassical approach with the3V ofBigData :
• Volume:Designed tostoreGB/TBofdata,butneedsPB(maybe EB).
• Variety:Heterogeneous andvariabletypesofdata,text,semi-structured
• Velocity:Dataareproduced moreandmorequickly
10
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
DBMS:Limitations• Standarddatabases▫ Functionalities� Joinsbetweentables� Complexqueries� Strongcoherencyofdata
ØRequirementsinadistributedcontext▫ Linksbetweenentities=>sameserver▫ ++links=>difficultiesfordataorganization
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
DBMS:Limitations(2)• ACID propertiesfortransactions▫ Setofoperations▫ Atomicity(integralcompletionornone)▫ Consistency(consistentatstartandend)▫ Isolation(nocommunicationbetweenthem)▫ Durability(anoperationcannotbereversed)▫ Pessimisticviewonconsistency
ØRequirementsinadistributedcontext▫ Difficultiesininsuringthoseproperties▫ Conflictwithefficiency/performances
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
ACIDvs BASE• ModernsystemsusetheBASEmodel▫ Optimisticviewonconsistency▫ BasicallyAvailable:� Anyrequest=>Ananswer� Eveninachangingstate▫ SoftState:� OppositetoDurability.� System’sstate(serversordata)couldchangeovertime(withoutanyupdate)
▫ Eventuallyconsistent:� Withtime,datacanbeconsistent� Updateshavetobepropagated
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
Solution:NoSQL• NoSQL :NotOnly SQL▫ Newdatastorage/managementapproach▫ Scalesupthesystem(throughdistribution)▫ Complexmetadatamanagement� Noschema
Ø DonotsubstituteDBMS,dedicatedto:▫ Veryhugevolumeofdata(PetaBytes)▫ Veryshortresponsetime▫ Consistencyisnotmandatory
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
DatabasesandNoSQL
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
NoSQL DB:Characteristics• Norelations=>Collections▫ Nofixstructures(naynone)• Complexdata(e.g.documents)▫ Objects,nesting,arrays• Datadistribution▫ Highparallelization(Map/Reduce)• Datareplication▫ Disponibility vsConsistency(notransactions)▫ Fewwrites,manyreads
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
Sharding :Scalability• Datablocks aredistributedinaclusterofservers• Horizontalpartitioning• 3typesoftechnics:
1. Resourceallocationbased:HDFS2. Tree-basedstructure:Clusteredindex(sort)3. Hash-basedstructure:ConsistentHashing
17
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
NoSQL Systems
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
SeveralNoSQL systems• Key-ValueStore▫ Dataareidentifiedbyauniquekey(usedforquerying)▫ DynamoDB,Voldemort,Redis,Riak,MemcacheDB• Columndata▫ Relation1-n“one-to-many”(messages,posts)▫ HBase,Hypertable,Spark,Elasticsearch• Documents▫ Complexesdata,attributes/values▫ MongoDB ,Cassandra,CouchDB,Terrastore• Graphs▫ Highlyconnectedentities,SocialNetworking▫ Neo4j,OrientDB,FlockDB
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
I- NoSQL&Key-Valuestore• Similartoadistributed“HashMap”• Key+Value▫ Nofixedschemaonvalues(strings,object,integer,binaries…)
• Drawbacks:▫ Nostructuresnortyping▫ Nostructured-basedqueries
Ø DynamoDB (Amazon),Redis (VMWare),Voldemort(LinkedIn)
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
I- NoSQL &Key-Valuestore(2)• CRUDOperations(HTTP)▫ Create(key,value)▫ Read(key)▫ Update(key,value)▫ Delete(key)• Horizontalscaling(partitionning/distribution)• Noverticaldistribution(datasegmentation)
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
I- NoSQL&Key-Valuestore(3)
22
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
II– NoSQL &Columns• Column-basedstorage▫ DBMS:tuples(lines)• Easytoinsertanewcolumn▫ Dynamicschema
ØBigTable/Hbase (Google),Cassandra(Facebook&Apache),SimpleDB (Amazon)
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
II– NoSQL &Columns(2)• Advantages:▫ XML/JSon support▫ Columnindexing▫ Horizontalscaling• Drawbacks:▫ Hardtoquerycomplexdata▫ Difficultforlinkeddata(distances,paths,time)▫ Pre-definedqueries(notonthefly)
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
II– NoSQL&Columns(3)
25
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
III– NoSQL &Documents• Basedonthekey-valuestore▫ Addsemi-structureddata(JSon/XML)• HTTPAPI▫ MorecomplexthanCRUD
ØMongoDB,CouchDB (Apache),RavenDB,Terrastore
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
III– NoSQL &Documents(2)• Documentmanagement▫ Simpletypes(Int,String,Date)▫ Nofixschema(docsmayvary)▫ Nesteddata• Advantages:▫ Richnessforqueries▫ Indexingseveralattributs▫ Easytoscaleup• Drawbacks:▫ Difficultiesfordatainterconnexions▫ Dedicatedtokey-value(id)
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
III– NoSQL&Documents(3)
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
IV– NoSQL &Graph• Storage:nodes,relationsandproperties▫ GraphTheory▫ Pathqueryingonthegraph▫ Dataareloadedondemand▫ Difficultiesformodeling
ØNeo4j,OrientDB (Apache),FlockDB (Twitter)
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
IV– NoSQL &Graph(2)• Availablestorage▫ Object(cf.documents)▫ Edges(withproperties)• DifficultforSharding
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
IV– NoSQL&Graph(3)
31
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
Brewer’s CAP Theorem (2000)• 3mainpropertiesfordistributedmanagement1. Consistency:
� Adatahavethesamevalueatthesametime(coherency)
2. Availability:� Evenifaserverisdown,dataisavailable
3. PartitionTolerance:� Evenifthesystemispartitioned,aquerymusthaveananswer(unlessforglobalfailures)
Ø Theorem:Adistributed,networkedsystemcanhaveonlytwo ofthesethreeproperties.
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
• InitiallyXMLusedforcomplexinternetcommunications(WebServices)▫ Tooverbose• JSON (JavaScript Object Notation)▫ Lightweight, text-oriented, language independent▫ Used for several Web services (Google API, Twitter
API)
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
JSon : Structures
• Key + Value
▫ “lastname” : “Travers”
▫ Keys with quotations
• Objects/documents
▫ Collection of key/values
▫ { “lastname” : “Travers”,
“firstname” : “Nicolas”,
“kind” : 1}
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
Data types
• Scalar : String, Integer, float, boolean, null…
• List : arrays [ … ]
• Documents : objetcs {…}
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
Arrays
• No typing inside arrays
▫ “lessons” : [“SQL”, 1, 4.2, null, “NoSQL”]
• Can nest documents
▫ “doc” : [ {“test” : 1},
{“test” : {“nesting” : 1.0}},
{“key” : “text”, “value” : null}]
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
JSon : Identifiers
• Key « _id » commonly used to identify documents
▫ Overwrite already stored ids
▫ Can be automaticaly generated
� Ex MongoDB : "_id" : ObjectId(1234567890)
Introduction to NoSQL
N. TraversCEDRIC Lab - Vertigo
JSon : complete example
{
“_id” : 1234,
“lastname” : “Travers”, “firstname” : “Nicolas”,
“work” : {
“company” : “Cnam”,
“location” : {
“street” : “2 rue conté”,
“city” : “Paris”,
“zip” : 75141
},
},
“fields” : [ “DB” , “DB tuning”, “XML”, “NoSQL”, “IR” ]
}