Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
CompSci 516DatabaseSystems
Lecture1Introduction
andDataModels
Instructor:Sudeepa Roy
1DukeCS,Fall2018 CompSci516:DatabaseSystems
CourseWebsite
• http://www.cs.duke.edu/courses/fall18/compsci516/
• Pleasecheckfrequentlyforupdates!
DukeCS,Fall2018 CompSci516:DatabaseSystems 2
Instructor• Sudeepa Roy
– [email protected]– https://users.cs.duke.edu/~sudeepa/– officehour:Thursdays11am-12noon,LSRCD325
• Aboutmyself– AssistantProfessorinCS– PhD:UPenn,Postdoc:Univ.ofWashington– JoinedDukeCSinFall2015– Researchinterests:
• DataAnalysis,causality,databasetheory,applicationsofdata,uncertaindata,dataprovenance,crowdsourcing
3DukeCS,Fall2018 CompSci516:DatabaseSystems
TwoTAs
• Tianpeng Chen– [email protected]
• Yuchao Tao– [email protected]
• BothCompSci 516veterans!– officehours:TBD
4DukeCS,Fall2018 CompSci516:DatabaseSystems
Logistics
• Discussionforum:Piazza– Allenrolledstudents(byyesterday)arealreadythere– SendmeanemailifyouhavenotreceivedawelcomeemailfromPiazza
• Toreachcoursestaff:– [email protected]– Pleaseusepiazzaasmuchaspossible
• Lectureslideswillbeuploadedbeforetheclassasnotes– butwillbeupdatedaftertheclass
DukeCS,Fall2018 CompSci516:DatabaseSystems 5
Grading
• ThreeHomework:30%• Project:15%• Midterm:20%• Final:30%• Classparticipation:5%
6DukeCS,Fall2018 CompSci516:DatabaseSystems
GradingStrategy• Relativegrading
– Theactualgradedistributionattheendwilldependontheperformanceoftheentireclassonallthecomponents.
– TopperoftheclassgetsA+irrespectiveofthenumber,andonly“aboveexpectation”performancesgetA+.
– Nofixedlowestgradeorgradedistribution.– Everyonecangetgoodgradebyworkinghard!
7DukeCS,Fall2018 CompSci516:DatabaseSystems
Homework• Duein2-3weeksaftertheyareposted/previoushw isdue
– ALWAYSstartearly!
• Nolatedays– contacttheinstructorifyouhavea*valid*reasontobelate– Anotherexam,project,hw isNOTavalidreason– wewillalwaysbe
fairtoall– Computercrash/medicalissues(followingofficialprocedures)may
countasvalidreasons– Noguaranteethatyourrequestwillbegranted– again,startearly!
• Tobedonestrictlyindividually
DukeCS,Fall2018 CompSci516:DatabaseSystems 8
HomeworkOverview• Youwilllearnhowtousetraditionalandnewdatabase
systemsinthehomework– Havetolearnthemmostlyonyourownfollowingtutorialsavailable
onlineandwithsomehelpfromtheTA
• HW1:TraditionalDBMS– SQLandPostgres
• HW2:Distributeddataprocessing– SparkandAWS
• HW3:NOSQL– MongoDB
DukeCS,Fall2018 CompSci516:DatabaseSystems 9
Exams• Midterm– Oct11(Thurs)• Final– Dec15(Sat)
• Inclass• Closedbook,closednotes,noelectronicdevices• Totalweight:20+30%=50%• Examswilltestyourunderstandingofthematerial
• Bothexamsarecomprehensive– wouldincludeeverylectureuptotheexams
DukeCS,Fall2018 CompSci516:DatabaseSystems 10
Projects• 15%weight• Ingroupsof3-4
– YoucanlookforgroupmembersthroughPiazzabyannouncingyourgeneralareaofinterestorifyouhaveaprobleminmind
– Eachgroupmembershoulddoapprox.equalwork
• Showyourcreativityandresearcher-side!• Workdoneshouldbeatleastequivalentto
– onehw *no.ofgroupmembers
• Allgroupmemberswillgetthesamegrade
DukeCS,Fall2018 CompSci516:DatabaseSystems 11
ProjectTopics• Anythingrelatedto“Data”
– Datamanagement/processing/cleaning– Datavisualization– Dataexplorationoranalysis– Applicationsofdata(toanyfield)– Theoreticalfindingswithdata– Newtoolfordataanalysis
• Chooseaprojectaccordingtoyourowninterests• Youcancheckoutmajordatabaseconferencesforideas,e.g.
– Demonstrations (buildaprototypesolvingaproblemorimprovingUI)• SIGMOD’18:https://sigmod2018.org/sigmod_demo_list.shtml• SIGMOD’17:http://sigmod2017.org/sigmod-program/#posters• SIGMOD’16:http://sigmod2016.org/sigmod_demo_list.shtml• VLDB’18:http://vldb2018.lncc.br/accepted-demonstrations.html?demo-a• VLDB’17:http://www.vldb.org/2017/accepted_papers_demo_track.php• VLDB’16:http://vldb2016.persistent.com/demonstrations.php
– Researchpapers(solveaproblem,doexperimentswithdata)• CheckoutpapersinSIGMODandVLDBfromrecentyears
• Youcancheckoutpreviousyearsoftheseconferencestoo,andotherconferencesfromyourownresearcharea
DukeCS,Fall2018 CompSci516:DatabaseSystems 12
ProjectDeliverables1. Projectproposal(due:9/20(Th),1-3pages)
– problemselectionispartoftheproject– 3weeksfromnow– butstartasap,lookforproblems,dorelatedworkstudy,findan
interestingquestion,letmeknowyourinitialthoughts,allbythedeadline
2. Midtermprogressreport(due:10/25(Th),3-5pages)3. Finalprojectreport(due:11/27(T),4-8pages)4. Afinal5-10minsprojectpresentationand/ordemonstration
(inthelast1-2classes)
13DukeCS,Fall2018 CompSci516:DatabaseSystems
ProjectEvaluationCriteriaApproximateweightsinascaleof100:1. Well-motivated?102. Novel?103. Comprehensiverelatedworksurvey?104. Qualityofwriting?10
– shouldreflectallotherfactorstooexceptclasspresentation
5. Classpresentation/demo?15– shouldreflectallotherfactorstooexceptwriting
6. Technicalcontributions?45– Problemformulation/Algorithms/Experiments/Theory/System/
Userinterface/Efficiency/Usability/Datasetexplorationetc.
DukeCS,Fall2018 CompSci516:DatabaseSystems 14
ClassParticipation
• 5%weight• Pop-upquiz
– Participation(2.5%)+correctanswering(2.5%)– lowestscorewillbedropped
• Ingeneral,activelyparticipateintheclass!– Askquestionsinclassandonpiazza– Stopmeasmanytimesasyouneedtounderstandthelectures– Answereachother’squestionsonpiazza
• Alsosend(anonymousornot)feedback,suggestions,orconcernsonPiazza– thereisa“feedback”folder
DukeCS,Fall2018 CompSci516:DatabaseSystems 15
ReadingMaterial
• Willmostlyfollowthe”cowbook”byRamakrishnan-Gehrke– Thechapternumberswillbeposted
• Youdonothavetobuythebooks,butitwillbegoodtoconsultthemfromtimetotime
• Youshouldbepreparedtodoquiteabitofreadingfromvariousbooksandpapers
16DukeCS,Fall2018 CompSci516:DatabaseSystems
Whatisthiscourseabout?
• Thisisagraduate-leveldatabasecourseinCS
• Wewillcoverprinciples,internals,andapplicationsofdatabasesystemsindepth
• Wewillalsohaveanintroductiontoafewadvancedresearchtopicsindatabases(laterinthecourse)
17DukeCS,Fall2018 CompSci516:DatabaseSystems
AQuickSurvey• Haveyoutakenanundergraddatabasecourseearlier
– CS316/equivalent?
• Areyoufamiliarwith– SQL?– RA?(σ, Π, ´, ⨝, r, È, Ç, -)– Keys, foreign keys?– Indexindatabases?– Logic:∧,∨,∀,∃,¬,∈, =>
– Transactions?– Map-reduce/Spark?
• Haveyoueverworkedwithadataset?– relationaldatabase,text,csv,XML
• Haveyoueverusedadatabasesystem?– PostGres,MySQL,SQLServer,SQLAzure
18DukeCS,Fall2018 CompSci516:DatabaseSystems
Whatwillbecovered?• Databaseconcepts
– DataModels,SQL,Views,Constraints,RA,Normalization
• Principlesandinternalsofdatabasemanagementsystems(DBMS)– Indexing,QueryExecution-Algorithms-Optimization,Transactions,
ParallelandDistributedQueryProcessing,MapReduce
• Advancedandresearchtopicsindatabases– e.g.Datalog,NOSQL,Datamining,Datawarehouse– Morewillbeaddedinthe“TBD”lectures
• Wewillgofastforsomebasictopicsindatabasescoveredinundergraddbcourses– Datamodel,SQL,RA– Butaskmetoslowdownifyouarenotfamiliarwiththem
19DukeCS,Fall2018 CompSci516:DatabaseSystems
WhatthiscourseisNOTabout
• Spark,AWS,clustercomputing…– PartiallycoveredinaHWandalecture
• Machinelearningbasedanalytics• Statisticalmethodsfordataanalytics• Python,R,…• Programming
DukeCS,Fall2018 CompSci516:DatabaseSystems 20
Background• Youshouldhavesomeunderstanding(attheCS
undergraduatelevel)– datastructure,discretemaths,algorithms– databases– orhavetolearntheseyourselfasnecessary
• Needtopickupnewcodingframeworkandprogramminglanguagesonyourown– andhowtoprocessdatausingthem– Homeworkassignmentswillmostlybeself-taught– …withhelpfromtheTA
• Willinvolvesomemathematicalandanalyticalreasoningtoo
DukeCS,Fall2018 CompSci516:DatabaseSystems 21
Whyshouldwecareaboutdatabases?
• Weareinadata-drivenworld
• Data=Currency,Data=Power,Data=Fun
• “BigData”issupposedtochangethemodeofoperationforalmosteverysinglefield– Science,Technology,Healthcare,Business,Manufacturing,Journalism,Government,Education,…
• Wemustknowhowtocollect,store,process,andanalyzesuchdata
22DukeCS,Fall2018 CompSci516:DatabaseSystems
Whyshouldwecareaboutdatabases?
• From“BigData”wiki:“TheLargeHadronColliderexperimentsrepresentabout150millionsensorsdeliveringdata40 milliontimespersecond.Therearenearly600 millioncollisionspersecond.IfallsensordatawererecordedinLHC,….thisisequivalentto500quintillion(5×1020)bytesperday,almost200timesmorethanalltheothersourcescombinedintheworld.”
23
Science
DukeCS,Fall2018 CompSci516:DatabaseSystems
Whyshouldwecareaboutdatabases?
• From“BigData”wiki:– eBay.com usestwodatawarehousesat7.5PB(x1012)and40PBaswellasa40PBHadoopclusterforsearch,consumerrecommendations,andmerchandising
– Facebookhandles50 billionphotosfromitsuserbase– AsofAugust2012,Googlewashandlingroughly100 billionsearchespermonth
24
Technology
DukeCS,Fall2018 CompSci516:DatabaseSystems
Whyshouldwecareaboutdatabases?
• From“BigData”wiki:– Healthcare:digitizationofpatient’sdata,prescriptiveanalytics
– Media:Tailorarticlesandadvertisementsthatreachtargetedpeople,validateclaims
• “ComputationalJournalism”projectinDukeDBgroup
– Manufacturing:supplyplanning– Sports:improvetraining,understandingcompetitors
25
HealthcareMediaManufacturingSports…..
DukeCS,Fall2018 CompSci516:DatabaseSystems
• DemocratizationofData!
• Moredatarelevanttothesocietyiscollected– smartphones,cars,sensors,roads,socialmedia,crimereports…
• Manypeopleareinterestedindata,butnoteveryoneknowshowtoanalyzethem
• Learningaboutdataprocessinganddatabasesystems=astepforward
DukeCS,Fall2018 CompSci516:DatabaseSystems 26
Whyshouldwecareaboutdatabases?
Whyshouldwecareaboutdatabases?
• Moore’sLaw:– Processingpowerdoublesevery18months
• Amountofdatadoublesevery9monthstoo!– Disksales(#ofbits)doublesevery9months
• Parkinson’sLaw:– Dataexpandstofillthespaceavailableforstorage
• Moore’Lawisreversed• Butwehavelimitedtimeinaday• Needsmarterdatamanagementandprocessingsystems!
27DukeCS,Fall2018 CompSci516:DatabaseSystemsSlideack:Prof.JunYang
Whyshouldwecareaboutdatabases?
• Simplystoringlargedatasetsinaflatfilestopsworkingatsomepoint– Needefficientmodel,storage,andprocessing
• ADBMStakescareofcommonissues– theuseronlyhastorunqueriestoprocesssuchdatasets– muchsimplerthanwritinglowlevelcode
28DukeCS,Fall2018 CompSci516:DatabaseSystems
Today
• DBMS• DataModels
• [RG]1.1,1.3-1.5
29DukeCS,Fall2018 CompSci516:DatabaseSystems
WhatisaDatabase?
• Adatabaseisacollectionofdata– typicallyrelatedanddescribingactivitiesofanorganization
• Adatabasemaycontaininformationabout– Entities
• students,faculty,courses,classroom
– Relationshipsbetweenentities• students’enrollment,facultyteachingcourses,roomsforcourses
30DukeCS,Fall2018 CompSci516:DatabaseSystems
Andwhatdoesitcontain?
WhyuseaDBMS• i.e.whynotusefilesystemandaprogramminglanguage?
• Supposeacompanyhasalargecollectionofdataonemployees,departments,products,salesetc.
• Requirements:– Quicklyanswerquestionsondata
• Notethatallthedatamaynotfitinmainmemory– Concurrentaccess:applychangesconsistently– Restrictedaccess(e.g.salary)
31DukeCS,Fall2018 CompSci516:DatabaseSystems
WhyuseaDBMS?
• ADBMSisapieceofsoftware(i.e.abigprogramwrittenbysomeoneelse)thatmakesthesetaskseasier– Quickaccess– Robustaccess– Safeaccess– Simpleraccess
• Next:somenicepropertiesofaDBMS
32DukeCS,Fall2018 CompSci516:DatabaseSystems
WhyuseaDBMS?
1. DataIndependence– Applicationprogramsshouldnotbeexposedtothedata
representationandstorage– DBMSprovidesanabstractviewofthedata
2. EfficientDataAccess– ADBMSutilizesavarietyofsophisticatedtechniquesto
storeandretrievedata(fromdisk)efficiently
33DukeCS,Fall2018 CompSci516:DatabaseSystems
WhyuseaDBMS?
3. DataIntegrityandSecurity– DBMSenforces“integrityconstraints”– e.g.check
whethertotalsalaryislessthanthebudget– DBMSenforces“accesscontrols”– whethersalary
informationcanbeaccessesbyaparticularuser
4. DataAdministration– Centralizedprofessionaldataadministrationby
experienceduserscanmanagedataaccess,organizedatarepresentationtominimizeredundancy,andfinetunethestorage
34DukeCS,Fall2018 CompSci516:DatabaseSystems
WhyuseaDBMS?
5. ConcurrentAccessandCrashRecovery– DBMSschedulesconcurrentaccessestothedatasuch
thattheusersthinkthatthedataisbeingaccessedbyonlyoneuseratatime
– DBMSprotectsdatafromsystemfailures
6. ReducedApplicationDevelopmentTime– Supportsmanyfunctionsthatarecommontoanumber
ofapplicationsaccessingdata– Provideshigh-levelinterface– Facilitatesquickandrobustapplicationdevelopment
35DukeCS,Fall2018 CompSci516:DatabaseSystems
WhenNOTtouseaDBMS?• DBMSisoptimizedforcertainkindofworkloadsand
manipulations
• Theremaybeapplicationswithtightreal-timeconstraintsorafewwell-definedcriticaloperations
• AbstractviewofthedataprovidedbyDBMSmaynotsuffice
• Toruncomplex,statistical/MLanalyticsonlargedatasets
36DukeCS,Fall2018 CompSci516:DatabaseSystems
DataModel• Applicationsneedtomodelsomerealworldunits• Entities:
– Students,Departments,Courses,Faculty,Organization,Employee,…
• Relationships:– Courseenrollmentsbystudents,Productsalesbyanorganization
• Adatamodelisacollectionofhigh-leveldatadescriptionconstructsthathidemanylow-levelstoragedetails
37DukeCS,Fall2018 CompSci516:DatabaseSystems
DataModelCanSpecify:
1. Structureofthedata– likearraysorstructs inaprogramminglanguage– butatahigherlevel(conceptualmodel)
2. Operationsonthedata– unlikeaprogramminglanguage,notanyoperationcanbeperformed– allowlimitedsetsofqueriesandmodifications– astrength,notaweakness!
3. Constraintsonthedata– whatthedatacanbe– e.g.amoviehasexactlyonetitle
38DukeCS,Fall2018 CompSci516:DatabaseSystems
ImportantDataModels
• StructuredData• Semi-structuredData• UnstructuredData
Whatarethese?
39DukeCS,Fall2018 CompSci516:DatabaseSystems
ImportantDataModels• StructuredData
– Allelementshaveafixedformat– RelationalModel(table)
• Semi-structuredData– Somestructurebutnotfixed– Hierarchicallynestedtagged-elementsintreestructure– XML
• UnstructuredData– Nostructure– text,image,audio,video
40DukeCS,Fall2018 CompSci516:DatabaseSystems
RelationalDataModel
• ProposedbyEdward(Ted)Codd in1970– wonTuringawardforit!
• Motivation:– Simplicity– Betterlogicalandphysicaldataindependence
DukeCS,Fall2018 CompSci516:DatabaseSystems 41
RelationalDataModel
• ThedatadescriptionconstructisaRelation– Representedasa“table”– Basicallya“set”ofrecords(setsemantic)– orderdoesnotmatter– andallrecordsaredistinct
• however,itistruefortherelationalmodel,notforstandardDBM– allowduplicaterows(bagsemantic)– unlessrestrictedbykeyconstraints.Why?
42DukeCS,Fall2018 CompSci516:DatabaseSystems
sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0
Students
Bag:{1,1,2,2,3,2,1,5,6,1}Set:{1,2,3,5,6}
Bagvs.Set
• Why“bagsemantic”andnot“setsemantic”instandardDBMSs?– Primarilyperformancereasons– Duplicateeliminationisexpensive(requiressorting)– Someoperationslike“projection”s aremuchmoreefficientonbags
thansets
43DukeCS,Fall2018 CompSci516:DatabaseSystems
sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0
Students
RelationalDataModel
44DukeCS,Fall2018 CompSci516:DatabaseSystems
sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0
Students Attribute/Column/Field
Tuple/Row/Record
Value
Whatisapoorlychosenattributeinthisrelation?
• Relationaldatabase=asetofrelations• ARelation:madeupoftwoparts
1. Schema2. Instance
SchemaandInstance• Oneschemacanhavemultipleinstances
• Schema:– Atemplatefordescribinganentity/relationship(e.g.students)– specifiesnameofrelation+nameandtypeofeachcolumne.g.Students(sid:string,name:string,login:string,age:integer,gpa:real).
• Instance:– Whenwefillinactualdatavaluesinaschema– atable,hasrowsandcolumns– eachrow/tuplefollowstheschemaanddomainconstraints– #Rows=cardinality,#fields=degreeorarity– examplebelow
DukeCS,Fall2018 CompSci516:DatabaseSystems 45
Cardinality = 3, degree = 5sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@ee 18 3.2
53650 Smith smith1@math 19 3.8
LevelsofAbstractionsinaDBMS
• Physicalschema– Storageasfiles,rowvs.
columnstore,indexes– willdiscussthesein
laterlectures
DukeCS,Fall2018 CompSci516:DatabaseSystems 46
Disk
PhysicalSchema
LogicalSchema
ExternalSchema External Schema ExternalSchema
LevelsofAbstractionsinaDBMS
• Logical/Conceptualschema– describesthestoreddatainthe
physicalschema
• Decidedbyconceptualschemadesign
– e.g.ERDiagram• notcoveredinthiscourse
– Normalization• willbecovered
Students(sid:string,name:string,login:string,age:integer,gpa:real)
DukeCS,Fall2018 CompSci516:DatabaseSystems 47
Disk
PhysicalSchema
LogicalSchema
ExternalSchema External Schema ExternalSchema
LevelsofAbstractionsinaDBMS
• Externalschema– different“views”ofthe
databasetodifferentusers
– willdiscussviewslater
• Onephysicalandlogicalschemabuttherecanbemultipleexternalschemas
DukeCS,Fall2018 CompSci516:DatabaseSystems 48
Disk
PhysicalSchema
LogicalSchema
ExternalSchema External Schema ExternalSchema
DataIndependence
• Applicationprogramsareinsulatedfromchangesinthewaythedataisstructuredandstored
• AveryimportantpropertyofaDBMS
• LogicalandPhysical
DukeCS,Fall2018 CompSci516:DatabaseSystems 49
LogicalDataIndependence• Userscanbeshieldedfromchangesinthelogical
structureofdata• e.g.Students:
Students(sid:string,name:string,login:string,age:integer,gpa:real)• Divideintotworelations
Students_public(sid:string,name:string,login:string)Students_private(sid:string,age:integer,gpa:real)
• Stilla“view”Studentscanbeobtainedusingtheabovenewrelations– by“joining”themwithsid
• AuserwhoqueriesthisviewStudentswillgetthesameanswerasbefore
DukeCS,Fall2018 CompSci516:DatabaseSystems 50
PhysicalDataIndependence
• Thelogical/conceptualschemainsulatesusersfromchangesinphysicalstoragedetails– howthedataisstoredondisk– thefilestructure– thechoiceofindexes
• Theapplicationremainsunaltered– Buttheperformancemaybeaffectedbysuchchanges
DukeCS,Fall2018 CompSci516:DatabaseSystems 51
Veryimportant
UnderstandtheCourse-Policy
See“whatisallowed/notallowed”
willberemindedineveryhwassignmenttoo
DukeCS,Fall2018 CompSci516:DatabaseSystems 52