1
L24:CourseEvaluation&Review
CS3200 Databasedesign(sp18 s2)https://course.ccs.neu.edu/cs3200sp18s2/4/12/2018
3
Lectures:1sthalf- fromauser’sperspective
1. SQL:Relationaldatamodels&Queries- ~5lectures ->6lectures- HowtomanipulatedatawithSQL,adeclarativelanguage
• reducedexpressivepowerbutthesystemcandomoreforyou
2. DatabaseDesign:Designtheoryandconstraints- ~6lectures ->7lectures- Designingrelationalschematokeepyourdatafromgettingcorrupted
3. Transactions:Syntax&supportingsystems- ~3lectures ->2lectures- Aprogrammer’sabstractionfordataconsistency
4
Lectures:2ndhalf- understandinghowitworks
4. Databaseinternals:QueryProcessing- ~7lectures ->5.5lectures- Indexing- ExternalMemoryAlgorithms(IOmodel)forsorting,joins,etc.- Basicsofqueryoptimization(CostEstimates)- Relationalalgebra
5. NoSQL- ~0-2lectures ->1.5lectures- Key-ValueStores- (MoreinCS6240:Large-ScaleParallelDataProcessing)
5
Studyingmaterial:"Underwhichstudyconditiondoyouthinkyoulearnbetter?"
Source:Karpicke&Blunt,"RetrievalPracticeProducesMoreLearningthanElaborativeStudyingwithConceptMapping,"Science,2011.
Judgedperformance(=whatpeoplethink)
Actualperformance(=whatisactuallyworking)
passivereading activeQ&A
6
SequencingMaterial:"Underwhichteachingconditiondoyouthinkyoulearnbetter?"
Source:Bjork&Bjork,"Makingthingshardonyourself,butinagoodway:Creatingdesirabledifficultiestoenhancelearning," Psychologyandtherealworld(...),2011.
7
Mypedagogicgoalsforclassroomeffectiveness
Increasedlearning Fairassessment
signalnoise ratio
Δlearningtimeinvested ratio
Goal
Metric
Implications minimize chores,have group HWs,"soft"graded HWs,no attendancecheck,in-class problems,classcontributions,interleaved,discussstudent solutions,...
exam:hard,comprehensive,individual,time-constrained
Risks "Slackingoff" Stress,"notfun"
8
0%#
100%#To
tal&Points&
Student&Popula0on&
MyoriginalGradingPhilosophy(andI'llgobacktoitnextyear)• nofixedpercentages(e.g.,30%forA)• nofixedcut-offs(e.g.,80/100forA)
A B
cut-offpointsdependonoverallclassinteractivityascomparedtootheryears
Iwillnotdisclosetheactualcut-offpoints.Don'taskforanexception.
Actual point distribution from apast finalexam:long,butfair!
C
9
Ideasfornextyear
• Noproject,butmorelongerhomeworks inlargergroups- keepsoftgradedHWs,hardexams
• Gradescope,someautograded Jupyter notebookchecksforcode• Topics:- 1SQL:nochange- 2Databasedesign:shortenandcompletelyreplacetheStanfordarrownotation
withcrowfoot/UML;personalizeGradiance- 3Transactions:extendandincludehands-onexercises- 4Databaseinternals:shorten;removeadvancedjoins,butkeepindicesandRA- 5NoSQL:extendwithhands-onwithall4typesofNoSQLdatabases;
• Tobedecided:allinJupyter,oractualinstallations,orpreinstalledonVMs
10
Reminder:FacultyCourseEvaluations
• Pleasetakethenext10minutestocompleteyourFacultyCourseEvaluation("TRACE")forthiscourse.
• Yourfeedback:- Helpsmeimprovethecourse- Helpsyourfellowstudentsmakebetterdecisionsaboutcoursesandprofessors- Isanonymous– Igetareportwithresultsandcomments2-3weeksaftergrades
arein- Shouldonlytake10minutestocomplete- Writtencommentsareespeciallyvaluable
11
Thanksforleaving*detailed*FeedbackJ1. Topics mostinterestingtoyou(andwhy):1SQL,2DBdesign3Transactions,4Internals,5NoSQL:more
orlessmaterial?/whatpartmostdifficult/slowerorfaster?2. Classorganization/Website/:didyoufindwhatyouwerelookingfor?/whatwasdifficulttofindor
follow?Whatwouldhavehelped?SuggestionsforBlackboardorwebsiteorPiazzaorGradescope?3. Installingsoftware:whatwasdifficulttoinstall?Whatwouldhavehelpedinadditiontotheprovided
PDFs(installationvideos?)ReplacePostgresql withMySQL(Postgresql isbetterforoptimizing,butwewon'tcoverthisindetailnexttime)?Virtualmachines?
4. Jupyter notebooks:whatwentwellorwrong?howtoimprove?5. Whataspectshelpedyoulearnandnotforget:morecoldcalling/groupexercises/shortslideexercises
(SQL)/hands-onSQLtyping/FMs onhomeworksolutions/OfficeHours/TAOfficeHours6. Learningmaterial:SQLonslidesvs.SQLtyping/slides/textbooks/otherresources7. Useofcomputers&socialmediainclass:yes/no8. HW &groupprojects:workingingroups/assignedrandomgroups/detailoffeedbackinHW solutions
/peerevaluation/slackingoffvs.learning9. Assessment&cheating:moreBritishstyle:homeworks =practice/final=test10.Bestpracticefromotherclasses/whattocopy*to*otherclasses.OtherwaysIcanhelp:officehours/
anonymousfeedbackform/5min breaks/5min "socialbreaks"whereIassignyoutotalktosomebody11.Howtomakeyouengagemoreactively?SQLworkedreallywell.Morerandomcallingfromclasslist?12.…
12
Review
• Aquicktourd'horizon throughthe5topicswediscussedinclass• Forfinalexam:everythingthatwascoveredinclass:- slides,homeworks,solutions,discussions,Piazza,Gradiance,Jupyther
14
SELECT SFROM R1,…,Rn
WHERE C1GROUP BY a1,…,ak
HAVING C2ORDER BY S2
Evaluation1. EvaluateFROM2. WHERE,applyconditionC13. GROUPBYtheattributesa1,…,ak4. ApplyconditionC2toeachgroup(mayhaveaggregates)5. ComputeaggregatesinSandreturntheresult6. SortrowsbyORDERBYclause
1234
5
C1: is any condition on the attributes in R1,…,Rn
C2: is any condition on aggregates and on attributes a1,…,ak
S: may contain attributes a1,…,ak and/or any aggregates but no other attributes
GeneralformofSQLQuery
6The logical order is useful for under-standing, but not always correct. The ANSI SQL standard does not require a specific processing order and leaves that to the implementation. Recall our intro example with SELECT DISTINCT and order by! Notice that that example can't be explained with the order shown here
15
From®Where® GroupBy® Select
SELECT product, sum(quantity) as TotalSalesFROM PurchaseWHERE price > 1GROUP BY product
Product TotalSalesBagel 40Banana 20
Product Price QuantityBagel 3 20Bagel 2 20Banana 1 50Banana 2 10Banana 4 10
123
4
Select contains• grouped attributes • and aggregates
Purchase308
16
Let'sconfusethedatabaseengine
SELECT product, quantityFROM PurchaseGROUP BY product
Product QuantityBagel ?Banana ?
Product Price QuantityBagel 3 20Bagel 2 20Banana 1 50Banana 2 10Banana 4 10
WhatquantityshouldtheDBreturnforBanana?
TheDBengineisconfused,thereisnosinglequantityforbanana(it'sanill-definedquery).Itshouldthusreturnanerror(onlySQLitemisbehavesandreturnssomething,butwhichmakesnosense).Pleasethinkthisthroughcarefully!
Purchase308
17
Don'tusenewAliasinHAVINGclause
Product Price QuantityBagel 3 20Bagel 2 20Banana 1 50Banana 2 10Banana 4 10
SELECT product, sum(quantity) as SumQFROM PurchaseWHERE quantity > 15GROUP BY productHAVING SumQ > 35
What does this query return over the given database?
Product SumQBagel 40Banana 50
Error in SQL server! Reason: HAVING is evaluated before SELECT!(However, SQLite works: different implementation)
Source:http://stackoverflow.com/questions/2068682/why-cant-i-use-alias-in-a-count-column-and-reference-it-in-a-having-clause
308
18
Don'tusenewAliasinHAVINGclause
Product Price QuantityBagel 3 20Bagel 2 20Banana 1 50Banana 2 10Banana 4 10
SELECT product, sum(quantity) as SumQFROM PurchaseWHERE quantity > 15GROUP BY productHAVING sum(quantity) > 35ORDER BY sumQ desc
What does this query return over the given database?
Product SumQBanana 50Bagel 40
308
Works! Notice that new sorting
19
etext eid fid ftextOne 1 1 UnThree 3 3 TroisFour 4 4 QuatreFive 5 5 CinqSix 6 6 Siz
Illustrationfid fText1 Un3 Trois4 Quatre5 Cinq6 Siz7 Sept8 Huit
EnglisheText eidOne 1Two 2Three 3Four 4Five 5Six 6
French
SELECT *FROM English, FrenchWHERE eid = fid
361
SELECT *FROM English JOIN FrenchON eid = fid
Sameas:
An"innerjoin":
"JOIN"sameas
"INNERJOIN"
20
etext eid fid ftextOne 1 1 UnTwo 2 NULL NULLThree 3 3 TroisFour 4 4 QuatreFive 5 5 CinqSix 6 6 SizNULL NULL 7 SeptNULL NULL 8 Huit
Illustrationfid fText1 Un3 Trois4 Quatre5 Cinq6 Siz7 Sept8 Huit
EnglisheText eidOne 1Two 2Three 3Four 4Five 5Six 6
French
SELECT *FROM English FULL JOIN FrenchON English.eid = French.fid
SQLitedoesnotsupport"FULLOUTERJOIN"sL (but"LEFTJOIN")
361
SELECT *FROM English JOIN FrenchON eid = fid
"FULLJOIN"sameas
"FULLOUTERJOIN"
21
2 7,81,3,4-6
Illustrationfid fText1 Un3 Trois4 Quatre5 Cinq6 Siz7 Sept8 Huit
EnglisheText eidOne 1Two 2Three 3Four 4Five 5Six 6
French 361
Source:Fig.7-2,Hofferetal.,ModernDatabaseManagement,10eded,2011.
= FULL (OUTER) JOIN
= (INNER) JOIN
22
EmptyGroupProblem
What’s wrong?
SELECT name, count(*)FROM Item, Purchase2WHERE name = iName
and month = 9GROUP BY name
Item(name,category)Purchase2(iName,store,month)
334
Compute,foreachproduct,thetotalnumberofsalesinSept(=month9)
23
SELECT name, count(store)FROM Item LEFT JOIN Purchase2 ON
name = iNameand month = 9
GROUP BY name
EmptyGroupProblem
Now we also get the products with 0 sales
Weneedtouseanattributefrom"Purchase2"togetthecorrect0count.Try"name"from"Item".
Item(name,category)Purchase2(iName,store,month)
Compute,foreachproduct,thetotalnumberofsalesinSept(=month9)
334
25
DatamodelingandDatabaseDesignProcess
DoctorPatient
name
zip name dno
patient_ofConceptualModel:("technologyindependent")describemaindataitems
LogicalModel("forrelationaldatabases"):Tables,ConstraintsFunctionalDependenciesNormalization:Eliminatesanomalies
Physicalstoragedetails
1.ERDiagram
2.RelationalDatabaseDesign
3.DatabaseImplementation
Result:PhysicalSchema
PhysicalModel
26
FromE/RDiagramstoRelationalSchema
• Keyconcept• Entitysetsbecomerelations,Relationshipscanbecomerelations(tablesinRDBMS)• Tablesareconnectedwithforeignkeyconstraints
• Adatabaseschema- Amapofthetablesandfields(attributes)inthedatabase- Thisiswhatisimplementedinthedatabasemanagementsystem- Partofthe“design”process
27
Example:translatethisERDv1intotables
Customer
First name
makes Order
Last name
City
State Zip
Price
Product name
Order DateOrder number
Product
contains
CustomerID
Quantity
Whatdowedo?
28
Example:translatethisERDv2intotables
Product
ProductID
ProductName
Price
Order
OrderNumber
OrderDate
Customer
CustomerID
FirstName
LastName
City
State
Zip
Quantity
ContainsMakes
Whatdowedo?
29
Example:OurOrderDatabaseschemaOriginal1:nrelationship
Originaln:nrelationship
• Order-Productisadecomposedmany-to-manyrelationship- Order-Producthasa1:n relationshipwithOrderandProduct- Nowanordercanhavemultipleproducts,andaproductcanbeassociated
withmultipleorders
Product
ProductID
ProductName
Price
Order-Product
OrderProductID
OrderNumber
ProductID
Quantity
Order
OrderNumber
OrderDate
CustomerID
Customer
CustomerID
FirstName
LastName
City
State
Zip
31
A)AssociativeEntityRelations(NoIdentifier)
Defaultprimarykeyfortheassociationrelationiscomposedoftheprimarykeysofthetwoentities(asinM:Nrelationship)
33
B)AssociativeEntityRelations(WithIdentifier)
• Identifierattributebecomesnewprimarykeyinrelation
• Foreignkeysreferenceallrelatedentities
Doweneedthekey?
34
RelationalSchemaDesign
Doyouseeanyanomalies?
Recallsetattributes(personswithseveralphones):
• Onepersonmayhavemultiplephones,butlivesinonlyonecity• Primarykeyisthus(SSN,PhoneNumber)
Name SSN PhoneNumber City
Fred 123-45-6789 412-555-1234 Boston
Fred 123-45-6789 412-555-6543 Boston
Joe 987-65-4321 908-555-2121 Westfield
Employee
35
RelationalSchemaDesign
Doyouseeanyanomalies?
Recallsetattributes(personswithseveralphones):
Whatdowedo????
• Onepersonmayhavemultiplephones,butlivesinonlyonecity• Primarykeyisthus(SSN,PhoneNumber)
Name SSN PhoneNumber City
Fred 123-45-6789 412-555-1234 Boston
Fred 123-45-6789 412-555-6543 Boston
Joe 987-65-4321 908-555-2121 Westfield
Employee
• Deletionanomalies:whatifJoedeleteshisphonenumber?(whatifJoehadnophone#)
• Insertanomalies:whatifJoegetsasecondphonenumber• Updateanomalies:whatifFredmovesto"NewYork"?
36
RelationDecompositionBreaktherelationintotwo:
Name SSN City
Fred 123-45-6789 Boston
Joe 987-65-4321 Westfield
SSN PhoneNumber
123-45-6789 412-555-1234
123-45-6789 412-555-6543
987-65-4321 908-555-2121Anomalieshavegone:• Nomorerepeateddata• EasytomoveFredto"NewYork"(how?)• EasytodeleteallJoe'sphonenumbers(how?)
Name SSN PhoneNumber City
Fred 123-45-6789 412-555-1234 Boston
Fred 123-45-6789 412-555-6543 Boston
Joe 987-65-4321 908-555-2121 Westfield
Employee
Employee Phone
37
KeysandSuperkeys
Asuperkey isasetofattributesA1,…,An s.t.foranyother attributeB inR,wehave {A1,…,An}à B
Akey isaminimal superkey(alsocalled"candidatekey")
I.e.allattributesarefunctionallydeterminedbyasuperkey
Thismeansthatnosubsetofakeyisalsoasuperkey (i.e.,droppinganyattributefromthekeymakesitnolongerasuperkey)
38
QuickrecapFDs• FunctionalDependency(FD):Thevalueofonesetofattributes(thedeterminant)uniquelydeterminesthe
valueofanothersetofattributes(thedependents)• Asuperkey (SK) isasasetofattributesofarelationschemauponwhichallattributesoftheschemaare
functionallydependent.• Acandidatekey(CK) isanon-redundant(minimal)SK• Primeattribute:belongingtosomecandidatekey• PartialFD:FDinwhichmorenon-primeattributesarefunctionallydependentonpart(butnotall)ofanyCK• TransitiveFD:AnFDbetweentwo(ormore)nonkey attributes• 3NF:nopartialnortransitiveFD
39
Boyce-CoddNormalForm(BCNF)
• Boyce-Codd normalform(BCNF)- ArelationisinBCNF,ifandonlyif,every(non-trival)determinantisasuperkey.
• Thedifferencebetween3NF andBCNF isthatforaFDAàB,- 3NF allowsthisdependencyinarelationifBisaprimary-keyattribute andAisnotacandidatekey,
- whereasBCNF insiststhatforthisdependencytoremaininarelation,AmustbeaSK(containaCK).
43
BCNFvs3NF• BCNF:ForeveryfunctionaldependencyX->YinasetF offunctionaldependenciesoverrelationR,either:- Xisasuperkey ofR- (orYisasubsetofX,thustheFDistrivial)
• 3NF:ForeveryfunctionaldependencyX->YinasetFoffunctionaldependenciesoverrelationR,either:- Xisasuperkey ofR- orYisasubsetofKforsomeCK (Yisprime)
• N.b., nosubsetofakeyisakey- (orYisasubsetofX,thustheFDistrivial)
44
AproblemwithBCNF{Unit} à {Company}{Company,Product} à {Unit}
WedoaBCNFdecompositionona“bad”FD:{Unit}+ = {Unit, Company}
WelosetheFD{Company,Product} à {Unit}!!
Unit Company Product… … …
Unit Company… …
Unit Product… …
{Unit} à {Company}
45
SoWhyisthataProblem?
Noproblemsofar.Alllocal FD’saresatisfied.
Unit CompanyGalaga99 NEUBingo NEU
Unit ProductGalaga99 DatabasesBingo Databases
Unit Company ProductGalaga99 NEU DatabasesBingo NEU Databases
Let’sputallthedatabackintoasingletableagain:
{Unit} à {Company}
ViolatestheFD{Company,Product} à {Unit}!!
{Unit} à {Company}{Company,Product} à {Unit}
46
TheProblem
• WestartedwithatableRandFDsF
• WedecomposedRintoBCNFtablesR1,R2,…withtheirownFDsF1,F2,…
• Weinsertsometuplesintoeachoftherelations—whichsatisfytheirlocalFDsbutwhenreconstructitviolatessomeFDacrosstables!
PracticalProblem:ToenforceFD,mustreconstructR—oneachinsert!
48
ACID
• Atomicity- Eitheralloperationsappliedornoneare(hence,weneednotworryabouttheeffectof
incomplete/failedtransactions)• Consistency- Eachtransactioncanstartwithaconsistentdatabaseandisrequiredtoleavethe
databaseconsistent(bringtheDBfromonetoanotherconsistentstate)• Isolation- Theeffectofatransactionshouldbeasifitistheonlytransactioninexecution(in
particular,changesmadebyothertransactionsarenotvisibleuntilcommitted)• Durability- Oncethesysteminformsatransactionsuccess,theeffectshouldholdwithoutregret,
evenifthedatabasecrashes(beforemakingallchangestodisk)
49
SchedulingExample1
Begin
Read(A,x)
x = x-100
Write(A,x)
Read(C,y)
y=y+100
Write(C,y)
Commit
txn1 txn2
Begin
Read(A,v)
v = v-100
Write(A,v)
Read(B,w)
w=w+100
Write(B,w)
Commit
Read(A,v)
v = v-100
Write(A,v)
Read(B,w)
w=w+100
Write(B,w)
Read(A,x)
x = x-100
Write(A,x)
Read(C,y)
y=y+100
Write(C,y)
50
SchedulingExample2
Begin
Read(A,x)
x = x-100
Write(A,x)
Read(C,y)
y=y+100
Write(C,y)
Commit
txn1 txn2
Begin
Read(A,v)
v = v-100
Write(A,v)
Read(B,w)
w=w+100
Write(B,w)
Commit
Read(A,v)
v = v-100
Write(A,v)
Read(B,w)
w=w+100
Write(B,w)
Read(A,x)
x = x-100
Write(A,x)
Read(C,y)
y=y+100
Write(C,y)
51
Recall:ConcurrencyasInterleavingTXNs
Wecalltheparticularorderofinterleavingaschedule
T1T2
R(A) R(B)W(A) W(B)
SerialSchedule:
R(A) R(B)W(A) W(B)
T1T2
R(A) R(B)W(A) W(B)
InterleavedSchedule:
R(A) R(B)W(A) W(B)
• Forourpurposes,havingTXNs occurconcurrentlymeansinterleavingtheircomponentactions(R/W)
52
Recall:“Good”vs.“bad”schedules
Wewanttodevelopwaysofdiscerning“good”vs.“bad”schedules
SerialSchedule:
T1
T2
R(A) R(B)W(A) W(B)
R(A) R(B)W(A) W(B)
T1
T2
R(A) R(B)W(A) W(B)
R(A) R(B)W(A) W(B)
T1
T2
R(A) R(B)W(A) W(B)
R(A) R(B)W(A) W(B)
X
InterleavedSchedules:
Why?
53
WaysofDefining“Good”vs.“Bad”Schedules
• Recall:wecallascheduleserializableifitisequivalenttosomeserialschedule- Weusedthisasanotionofa“good”interleavedschedule,sinceaserializableschedulewillmaintainisolation&consistency
• Now,we’lldefineastricter,butveryusefulvariant:- Conflictserializability
We’llneedtodefineconflicts first..
54
Conflicts
Twoactionsconflict iftheyarepartofdifferentTXNs,involvethesamevariable,andatleastoneofthemisawrite
T1
T2
R(A) R(B)W(A) W(B)
R(A) R(B)W(A) W(B)W-RConflict
W-WConflict
55
Conflicts
Twoactionsconflict iftheyarepartofdifferentTXNs,involvethesamevariable,andatleastoneofthemisawrite
T1
T2
R(A) R(B)W(A) W(B)
R(A) R(B)W(A) W(B)
All“conflicts”!
56
ConflictSerializability
• Twoschedulesareconflictequivalent if:
- TheyinvolvethesameactionsofthesameTXNs
- EverypairofconflictingactionsoftwoTXNs areorderedinthesameway
• ScheduleSisconflictserializable ifSisconflictequivalenttosomeserialschedule
Conflictserializable⇒ serializableSoifwehaveconflictserializable,wehaveconsistency&isolation!
57
Recall:“Good”vs.“bad”schedules
57
Conflictserializability alsoprovidesuswithanoperativenotionof“good”vs.“bad”schedules!
SerialSchedule:
T1
T2
R(A) R(B)W(A) W(B)
R(A) R(B)W(A) W(B)
T1
T2
R(A) R(B)W(A) W(B)
R(A) R(B)W(A) W(B)
T1
T2
R(A) R(B)W(A) W(B)
R(A) R(B)W(A) W(B)
X
InterleavedSchedules:
Notethatinthe“bad”schedule,theorderofconflictingactionsisdifferentthantheabove(orany)serialschedule!
58
Note:Conflictsvs.Anomalies
• Conflicts arethingswetalkabouttohelpuscharacterizedifferentschedules- Presentinboth“good”and“bad”schedules
• Anomalies areinstanceswhereisolationand/orconsistencyisbrokenbecauseofa“bad”schedule- Weoftencharacterizedifferentanomalytypesbywhattypesofconflictspredicatedthem
59
TheConflictGraph
• Let’snowconsiderlookingatconflictsattheTXN level
• ConsideragraphwherethenodesareTXNs,andthereisanedgefromTi àTj ifanyactionsinTi precedeandconflictwith anyactionsinTj
T1 T2
T1
T2
R(A) R(B)W(A) W(B)
R(A) R(B)W(A) W(B)
60
Whatcanwesayabout“good”vs.“bad”conflictgraphs?
SerialSchedule:
T1
T2
R(A) R(B)W(A) W(B)
R(A) R(B)W(A) W(B)
T1
T2
R(A) R(B)W(A) W(B)
R(A) R(B)W(A) W(B)
T1
T2
R(A) R(B)W(A) W(B)
R(A) R(B)W(A) W(B)
X
InterleavedSchedules:
Abitcomplicated…
61
Whatcanwesayabout“good”vs.“bad”conflictgraphs?
SerialSchedule:
X
InterleavedSchedules:
T1 T2 T1 T2
T1 T2
Theorem:Scheduleisconflictserializable ifandonlyifitsconflictgraphisacyclic
Simple!
63
RunningExternalMergeSortonLargerFiles
Disk
31,33 44,5510,12
18,43 24,2745,38
Assumewestillonlyhave3 bufferpages(Buffernotpictured);M=3
31,33 47,5510,12
18,22 23,2041,3
31,33 39,5542,46
18,23 24,271,3
48,33 44,4010,12
18,22 24,2716,31
64
RunningExternalMergeSortonLargerFiles
Disk
31,33 44,5510,12
18,43 24,2745,38
31,33 47,5510,12
18,22 23,2041,3
31,33 39,5542,46
18,23 24,271,3
48,33 44,4010,12
18,22 24,2716,31
1. Split into files small enough to sort in buffer…
Assumewestillonlyhave3 bufferpages(Buffernotpictured);M=3
65
RunningExternalMergeSortonLargerFiles
Disk
31,33 44,5510,12
27,38 43,4518,24
31,33 47,5510,12
20,22 23,413,18
39,42 46,5531,33
18,23 24,271,3
33,40 44,4810,12
22,24 27,3116,18
1. Split into files small enough to sort in buffer… and sort
Assumewestillonlyhave3 bufferpages(Buffernotpictured);M=3
Calleachofthesesortedfilesarun
66
RunningExternalMergeSortonLargerFiles
Disk
31,33 44,5510,12
27,38 43,4518,24
31,33 47,5510,12
20,22 23,413,18
39,42 46,5531,33
18,23 24,271,3
33,40 44,4810,12
22,24 27,3116,18
2. Now merge pairs of (sorted) files… the resulting files will be sorted!
Disk
18,24 27,3110,12
43,44 45,5533,38
12,18 20,223,10
33,41 47,5523,31
18,23 24,271,3
39,42 46,5531,33
16,18 22,2410,12
33,40 44,4827,31
Assumewestillonlyhave3 bufferpages(Buffernotpictured);M=3
67
RunningExternalMergeSortonLargerFiles
Disk
31,33 44,5510,12
27,38 43,4518,24
31,33 47,5510,12
20,22 23,413,18
39,42 46,5531,33
18,23 24,271,3
33,40 44,4810,12
22,24 27,3116,18
3. And repeat…
Disk
18,24 27,3110,12
43,44 45,5533,38
12,18 20,223,10
33,41 47,5523,31
18,23 24,271,3
39,42 46,5531,33
16,18 22,2410,12
33,40 44,4827,31
Disk
10,12 12,183,10
22,23 24,2718,20
33,33 38,4131,31
45,47 55,5543,44
10,12 16,181,3
23,24 24,2718,22
31,33 33,3927,31
44,46 48,5540,42
Assumewestillonlyhave3 bufferpages(Buffernotpictured);M=3
Calleachofthesestepsapass
68
RunningExternalMergeSortonLargerFiles
Disk
31,33 44,5510,12
27,38 43,4518,24
31,33 47,5510,12
20,22 23,413,18
39,42 46,5531,33
18,23 24,271,3
33,40 44,4810,12
22,24 27,3116,18
4. And repeat!
Disk
18,24 27,3110,12
43,44 45,5533,38
12,18 20,223,10
33,41 47,5523,31
18,23 24,271,3
39,42 46,5531,33
16,18 22,2410,12
33,40 44,4827,31
Disk
10,12 12,183,10
22,23 24,2718,20
33,33 38,4131,31
45,47 55,5543,44
10,12 16,181,3
23,24 24,2718,22
31,33 33,3927,31
44,46 48,5540,42
Disk
3,10 10,101,3
12,16 18,1812,12
20,22 22,2318,18
24,24 27,2723,24
31,31 31,3327,31
33,38 39,4033,33
43,44 44,4541,42
48,55 55,5546,47
69
Simplified3-pageBufferVersion
AssumeforsimplicitythatwesplitanN-pagefile intoNsingle-pageruns andsortthese;then:
• Firstpass:MergeN/2pairsofrunseachoflength1page
• Secondpass:MergeN/4pairsofrunseachoflength2pages
• Ingeneral,forNpages,wedo 𝒍𝒐𝒈𝟐 𝑵 passes- +1fortheinitialsplit&sort
• Eachpassinvolvesreadingin&writingoutallthepages=2NIO
Unsortedinputfile
Split&sort
Merge
Merge
Sorted!
à 2N*( 𝒍𝒐𝒈𝟐 𝑵 +1)totalIOcost!
70
Recap:High-leveloverview:indexes
id age salary other
006 19 50k ...
005 20 55k ...
004 25 50k ...
007 30 80k ...
002 35 75k ...
003 35 70k ...
001 40 65k ...
id age salary other
006 19 50k ...
004 25 50k ...
005 20 55k ...
001 40 65k ...
003 35 70k ...
002 35 75k ...
007 30 80k ...
datafile=indexfileclustered(primary)index
indexfileunclustered (secondary)index
71
NLJ:Orderoftablesmatters
M=102pages
100
1R
S
B(S)=1000B(R)=500
1´5´100
B(R)+B(R)/(M-2)´B(S)500+(500/100)´1,000=5,500
CostR:500CostS:5,000=5´1,000SUM:5,500
5´1000´1
100
1
S
R
B(R)=500B(S)=1000
1´10 100
B(S)+B(R)/(M-2)´B(S)1000+(1,000/100)´500=6,000
CostS:1,000CostR:5,000=10´500SUM:6,000
10´50´1
Ignoringoutputcost
VariantofExample15.4from"Cowbook"(Ramakrishan,Gehrke,Databasemanagementsystems,2003)
1
1
72
RAExpressionsCanGetComplex!
sname=fred sname=gizmo
P pidP ssn
seller-ssn=ssn
pid=pid
buyer-ssn=ssn
P name
PersonPurchasePersonProduct
74
SQLMeansMorethanSQL
• SQLstandsforthequerylanguage• ButcommonlyreferstothetraditionalRDBMS:- Relationalstorage ofdata
• Eachtupleisstoredconsecutively- Joins asfirst-classcitizens
• Infact,normalformspreferjoinstomaintenance- Strongguarantees ontransactionmanagement
• Noconsistencyworrieswhenmanytransactionsoperatesimultaneouslyoncommondata
• Focusonscalingup- Thatis,makeasinglemachinedomore,faster
75
Verticalvs.HorizontalScaling
"scalingup"
• Verticalscaling("scaleup"):youscalebyaddingmorepower(CPU,RAM)
• Horizontalscaling("scaleout"): youscalebyaddingmoremachines
"scalingout"
76
Verticalvs.Horizontalpartitioning
Source:http://www.piyushgupta.co.uk/2016/04/database-scaling-jargons.html,http://slideplayer.com/slide/12131436/70/images/17/SQL+Azure+Azure+Custom+Sharding.jpg
77
WeWillLookat4DataModels
Column-Family Store
Key/Value Store
Document Store Graph Databases
Source:BennyKimelfeld
78
ACIDMayBeOverlyExpensive
• Inquiteafewmodernapplications:- ACIDcontrastswithkeydesiderata:highvolume,highavailability-Wecanlivewithsomeerrors,tosomeextent- Ormoreaccurately,weprefertosuffererrorsthantobesignificantlylessfunctional
• Canthispointbemademore“formal”?
79
CAPServiceProperties
• Consistency:- everyread(toanynode)getsaresponsethatreflectsthemostrecentversionofthedata• Moreaccurately,atransactionshouldbehaveasifitchangestheentirestatecorrectlyinaninstant,Ideasimilartoserializability
• Availability:- everyrequest(toalivingnode)getsananswer:setsucceeds,getretunesavalue(ifyoucantalktoanodeinthecluster,itcanreadandwritedata)
• Partitiontolerance:- servicecontinuestofunctiononnetworkfailures(clustercansurvive
• Aslongasclientscanreachservers
81
SimpleIllustration
set(x,1)
set(x,1)
ok
ok
get(x)
1
CAConsistency,Availability
set(x,2)
set(x,2)
wait...
get(x) CPConsistency,Partitiontolerance
set(x,2)
set(x,2)
ok
get(x) APAvailability,Partitiontolerance
1
1
Availability
Consistency
OurRelationalDatabaseworldsofar…
Inasystemthatmaysufferpartitions,youhavetotradeoffconsistencyvs.availability
83
TheBASEModel
• AppliestodistributedsystemsoftypeAP• BasicAvailability- Providehighavailabilitythroughdistribution:Therewillbearesponsetoanyrequest.
Responsecouldbea‘failure’toobtaintherequesteddata,orthedatamaybeinaninconsistentorchangingstate.
• Softstate- Inconsistency(staleanswers)allowed:Stateofthesystemcanchangeovertime,so
evenduringtimeswithoutinput,changescanhappendueto‘eventualconsistency’
• Eventualconsistency- Ifupdatesstop,thenaftersometimeconsistencywillbeachieved
• Achievedbyprotocolstopropagateupdatesandverifycorrectnessofpropagation(gossipprotocols)
• Philosophy:besteffort,optimistic,stalenessandapproximationallowed
84
4maindatamodels
1. Key-valuestores(e.g.,Redis)2. Column-familystores(e.g.,Cassandra)3. Documentstores(e.g.,MongoDB)4. Graphdatabases(e.g.,Neo4j)
85
Key-ValueStores
• Essentially,bigdistributedhashmaps• OriginattributedtoDynamo– Amazon’sDBforworld-scalecatalog/cartcollections- ButBerkeleyDBhasbeenherefor>20years
• Storepairs⟨key,opaque-value⟩- OpaquemeansthatDBdoesnotassociateanystructure/semanticswiththevalue;oblivioustovalues
- Thismaymeanmoreworkfortheuser:retrievingalargevalueandparsingtoextractanitemofinterest
• Sharding viapartitioningofthekeyspace- Hashing,gossipandremappingprotocolsforloadbalancingandfaulttolerance
86
key value
set x 10 x 10
hset h y 5 h yà5
hset h1 name twohset h1 value 2_ h1
nameàtwovalueà2
hmset p:22 name Alice age 25 p:22 nameàAliceageà25
sadd s 20___sadd s Alicesadd s Alice s {20,Alice}
rpush l arpush l blpush l c l (c,a,b)
(simple value)
(hash table)
(set)
(list)
key maps to:
ExampleofRedis Commands
get x>> 10
hget h y>> 5
hkeys p:22>> name , age
smembers s>> 20 , Alice
scard s>> 2
llen l>> 3
lrange l 1 2 >> a , b
lindex l 2>> b
lpop l >> c
rpop l >> b
87
key value
set x 10 x 10
hset h y 5 h yà5
hset h1 name twohset h1 value 2_ h1
nameàtwovalueà2
hmset p:22 name Alice age 25 p:22 nameàAliceageà25
sadd s 20___sadd s Alicesadd s Alice s {20,Alice}
rpush l arpush l blpush l c l (c,a,b)
(simple value)
(set)
(list)
key maps to:
ExampleofRedis Commands
get x>> 10
hget h y>> 5
hkeys p:22>> name , age
smembers s>> 20 , Alice
scard s>> 2
llen l>> 3
lrange l 1 2 >> a , b
lindex l 2>> b
lpop l >> c
rpop l >> b
(hash table)
88
Whentouseit
• Useit:- Allaccesstothedatabasesisviaprimarykey- Storingsessioninformation(websession)- userorproductprofiles(singleGEToperation)- shoppingcardinformation(basedonuserid)
• Don'tuseit:- relationshipsbetweendifferentsetsofdata- querybydata(basedonvalues)- operationsonmultiplekeysatatime
89
4maindatamodels
1. Key-valuestores(e.g.,Redis)2. Column-familystores(e.g.,Cassandra)3. Documentstores(e.g.,MongoDB)4. Graphdatabases(e.g.,Neo4j)
90
keyspace
2TypesofColumnStores
sid name address year faculty861 Alice Haifa 2 NULL753 Amir London NULL CS955 Ahuva NULL 2 IE
StandardRDB
id sid1 8612 7533 955
id name1 Alice2 Amir3 Ahuva
id address1 Haifa2 London
id year1 23 2
id faculty2 CS3 IE
Eachcolumnstoredseparately.Why?Efficiency (fetchonlyrequiredcolumns),
compression,sparse dataforfree
1 sid:861 name:Aliceaddress:Haifa ts:20
2 sid:753 name:Amiraddress:London ts:22
3 sid:955name:Ahuva ts:32
1 year:2 ts:26
2 faculty:CS ts:25email:{prime:c@d ext:c@e}
3 year:2 faculty:IE ts:32email:{prime:a@b ext:a@c}
columnfamily
columnfamily
“column”
“supercolumn”
Column-FamilyStore (NoSQL)
timestampforconflicts
Columnstore (stillSQL)
Cassandradatamodel
91
Whentouseit(e.g.Cassandra)
• Useit:- Eventlogging(multipleapplicationscanwriteindifferentcolumnsandrow-key:appname:timestamp)
- CMS:Storeblogentrieswithtags,categories,linksindifferentcolumns- Counters:e.g.visitorsofapage
• Don'tuseit:- ifyourequireACID,consistency- ifyouchangequerypatternsoften(inRDMS schemachangesarecostly,inCassandraquerychangesare:requirechangingthecolumnfamilydesign)
92
4maindatamodels
1. Key-valuestores(e.g.,Redis)2. Column-familystores(e.g.,Cassandra)3. Documentstores(e.g.,MongoDB)4. Graphdatabases(e.g.,Neo4j)
93
DocumentStores
• Similarinnaturetokey-valuestore,butvalueistreestructured asadocument
• Motivation:avoidjoins;ideally,allrelevantjoinsalreadyencapsulatedinthedocumentstructure
• Adocumentisanatomicobjectthatcannotbesplitacrossservers- Butadocumentcollectionwillbesplit
• Moreover,transactionatomicity istypicallyguaranteedwithinasingledocument
• Modelgeneralizescolumn-familyandkey-valuestores
94
DataExample:High-level
{name:"Alice",age:21,status:"A",groups:["algorithms","theory"]
}
Document
Source:Modifiedfromhttps://docs.mongodb.com/v3.0/core/crud-introduction/
Collection{
name:"Alice",age:21,status:"A",groups:["algorithms","theory"]
}
{name:"Bob",age:18,status:"B",groups:["database","cooking"]
}
{name:"Charly",age:22,status:"A",groups:["database","cars"]
}
{name:"Dorothee",age:16,status:"A",groups:["cars","sports"]
}
~record/row/tuple ~table
95
DataExample
{item:"ABC2",details:{model:"14Q3",manufacturer:"M1Corporation"},stock:[{size:"M",qty:50}],category:"clothing”
}
{item:"MNO2",details:{model:"14Q3",manufacturer:"ABCCompany"},stock:[{size:"S",qty:5},{size:"M",qty:5},{size:"L",qty:1}],category:"clothing”
}
Collectioninventory
db.inventory.insert({item:"ABC1",details:{model:"14Q3",manufacturer:"XYZCompany"},stock:[{size:"S",qty:25},{size:"M",qty:50}],category:"clothing"}
) Document insertionSource:Modifiedfromhttps://docs.mongodb.com/v3.0/core/crud-introduction/
96
ExampleofaSimpleQuery
{_id:"a",cust_id:"abc123",status:"A",price:25,items:[{sku:"mmm",qty:5,price:3},
{sku:"nnn",qty:5,price:2}]}{
_id:"b",cust_id:"abc124",status:"B",price:12,items:[{sku:"nnn",qty:2,price:2},
{sku:"ppp",qty:2,price:4}]}
Collectionordersdb.orders.find({status:"A"},{cust_id:1,price:1,_id:0}
)
InSQLitwouldlooklikethis:SELECTcust_id,priceFROMordersWHEREstatus="A"
{cust_id:"abc123",price:25
}
selection
projection
Findallordersandpricewithwithstatus"A"
97
Whentouseit
• Useit:- Eventlogging:differenttypesofeventsacrossanenterprise- CMS:usercomments,registration,profiles,web-facingdocuments- E-commerce:flexibleschemaforproducts,evolvedatamodels
• Don'tuseit:- ifyourequireatomiccross-documentoperations- queriesagainstvaryingaggregatestructures
98
4maindatamodels
1. Key-valuestores(e.g.,Redis)2. Column-familystores(e.g.,Cassandra)3. Documentstores(e.g.,MongoDB)4. Graphdatabases(e.g.,Neo4j)