View
6
Download
0
Category
Preview:
Citation preview
CompSci 516DatabaseSystems
Lecture1Introduction
andDataModels
Instructor:Sudeepa Roy
1DukeCS,Fall2017 CompSci516:DatabaseSystems
CourseWebsite
• http://www.cs.duke.edu/courses/fall17/compsci516/
• Pleasecheckfrequentlyforupdates!
• NewRoom:LSRCD106
DukeCS,Fall2017 CompSci516:DatabaseSystems 2
Instructor• Sudeepa Roy
– sudeepa@cs.duke.edu– https://users.cs.duke.edu/~sudeepa/– officehour:Mondays11:30am-12:30pm,LSRCD325
• Aboutmyself– AssistantProfessorinCS– PhD:UPenn,Postdoc:Univ.ofWashington– JoinedDukeCSinFall2015– Researchinterests:
• Databases(theoryandapplications)• DataAnalysis,causality,explaininganswers• Uncertaindata,dataprovenance,crowdsourcing
3DukeCS,Fall2017 CompSci516:DatabaseSystems
Two(half-)TAs
• YilinGao– yilin.gao@duke.edu– officehour:Wed,3-4pm,Location:TBD
• Keping Wang– keping.wang@duke.edu– officehour:Thurs,3-4pm,Location:TBD
• BothCompSci 516veterans!4DukeCS,Fall2017 CompSci516:DatabaseSystems
Logistics
• Homeworksubmission:Sakai– Allenrolledstudentsarealreadythere
• Discussionforum:Piazza– Allenrolledstudentsarealreadythere– SendmeanemailifyouhavenotreceivedawelcomeemailfromPiazza
• Lectureslideswillbeuploadedbeforetheclass– butwillbeupdatedaftertheclass
DukeCS,Fall2017 CompSci516:DatabaseSystems 5
Grading
• ThreeHomework:30%• Project:15%• TwoMidterms:25+25=50%• Classparticipation:5%
6DukeCS,Fall2017 CompSci516:DatabaseSystems
GradingStrategy• Relativegrading
– Theactualgradedistributionattheendwilldependontheperformanceoftheentireclassonallthecomponents.
– TopperoftheclassgetsA+irrespectiveofthenumber,andonly“aboveexpectation”performancesgetA+.
– Nofixedlowestgradeorgradedistribution.– SEveryone cangetgoodgradebyworkinghard!
7DukeCS,Fall2017 CompSci516:DatabaseSystems
Homework• Duein2-3weeksaftertheyareposted/previoushw isdue
– ALWAYSstartearly!
• Nolatedays– contacttheinstructorifyouhavea*valid*reasontobelate– Anotherexam,project,hw isNOTavalidreason– wewillalwaysbe
fairtoall– Computercrash/suddeninterviewtrips/medicalissues(following
officialprocedures)maycountasvalidreasons– Noguaranteethatyourrequestwillbegranted– again,startearly!
• Tobedoneindividually
DukeCS,Fall2017 CompSci516:DatabaseSystems 8
HomeworkOverview• Youwilllearnhowtousetraditionalandnewdatabase
systemsinthehomework– Havetolearnthemmostlyonyourownfollowingtutorialsavailable
onlineandwithsomehelpfromtheTA
• HW1:TraditionalDBMS– SQLandPostgres
• HW2:Distributeddataprocessing– SparkandAWS
• HW3:NOSQL– e.g.MongoDB
DukeCS,Fall2017 CompSci516:DatabaseSystems 9
Exams• Midterm-1– Oct11(Wed)• Midterm-2– Nov29(Wed)
• Inclass• Closedbook,closednotes,noelectronicdevices• Totalweight:25+25%=50%• Examswilltestyourunderstandingofthematerial
• Bothexamsarecomprehensive– wouldincludeeverylectureuptothemidterm
DukeCS,Fall2017 CompSci516:DatabaseSystems 10
Projects• 15%weight• Ingroupsof3-4
– YoucanlookforgroupmembersthroughPiazzabyannouncingyourgeneralareaofinterestorifyouhaveaprobleminmind
– Eachgroupmembershoulddoapprox.equalwork
• Showyourcreativityandresearcher-side!• Workdoneshouldbeatleastequivalentto
– ahw *no.ofgroupmembers
• Allgroupmemberswillgetthesamegrade
DukeCS,Fall2017 CompSci516:DatabaseSystems 11
ProjectTopics• Anythingrelatedto“Data”
– Datamanagement/processing/cleaning– Datavisualization– Dataexplorationoranalysis– Applicationsofdata(toanyfield)– Theoreticalfindingswithdata– Newtoolfordataanalysis
• Chooseaprojectaccordingtoyourresearchinterest• Youcancheckoutmajordatabaseconferencesforideas,e.g.
– Demonstrations (buildaprototypesolvingaproblemorimprovingUI)• SIGMOD’17:http://sigmod2017.org/sigmod-program/#posters• SIGMOD’16:http://sigmod2016.org/sigmod_demo_list.shtml• VLDB’17:http://www.vldb.org/2017/accepted_papers_demo_track.php• VLDB’16:http://vldb2016.persistent.com/demonstrations.php
– Researchpapers(solveaproblem,doexperimentswithdata)• CheckoutpapersinSIGMODandVLDBfromrecentyears
– Youcancheckoutpreviousyearstoo,andconferencesfromyourownresearcharea
DukeCS,Fall2017 CompSci516:DatabaseSystems 12
ProjectDeliverables1. Projectproposal(due:9/20(W),1-3pages)
– problemselectionispartoftheproject– 3weeksfromnow– butstartasap,lookforproblems,dorelatedworkstudy,findan
interestingquestion,letmeknowyourinitialthoughts,allbythedeadline
2. Midtermprogressreport(due:10/25(W),3-5pages)3. Finalprojectreport(due:11/30(Th),4-8pages)4. Afinal5-10minsprojectpresentationand/ordemonstration
(inthelast1-2classes)
13DukeCS,Fall2017 CompSci516:DatabaseSystems
ProjectEvaluationCriteriaScaleof100:1. Well-motivated?102. Novel?103. Comprehensiverelatedworksurvey?104. Qualityofwriting?10
– shouldreflectallotherfactorstooexceptclasspresentation
5. Classpresentation/demo?15– shouldreflectallotherfactorstooexceptwriting
6. Technicalcontributions?45– Problemformulation/Algorithms/Experiments/Theory/System/
Userinterface/Efficiency/Usability/Datasetexplorationetc.
DukeCS,Fall2017 CompSci516:DatabaseSystems 14
ClassParticipation• 5%weight• Includes
– Participationinclass(Q/A)– Pop-upquiz(youwillgettokenbyemailtoenrollin“gradiance”)
• Participation+correctanswering(lowesttwoscoreswillbedropped)– Evaluatingothers’projectsduringtheprojectpresentation
Ingeneral,• Activelyparticipateintheclass!
– Askquestionsinclassandonpiazza– Stopmeasmanytimesasyouneedtounderstandthelectures– Answereachother’squestionsonpiazza
• Alsosend(anonymousornot)feedback,suggestions,orconcernsonPiazza– thereisa“feedback”folder
DukeCS,Fall2017 CompSci516:DatabaseSystems 15
ReadingMaterial
• Willmostlyfollowthe”cowbook”byRamakrishnan-Gehrke– Thechapternumberswillbeposted
• Youdonothavetobuythebooks,butitwillbegoodtoconsultthemfromtimetotime
• Youshouldbepreparedtodoquiteabitofreadingfromvariousbooksandpapers
16DukeCS,Fall2017 CompSci516:DatabaseSystems
Whatisthiscourseabout?
• Thisisagraduate-leveldatabasecourseinCS
• Wewillcoverprinciples,internals,andapplicationsofdatabasesystemsindepth
• Wewillalsohaveanintroductiontoafewadvancedresearchtopicsindatabases(laterinthecourse)
17DukeCS,Fall2017 CompSci516:DatabaseSystems
AQuickSurvey• Haveyoutakenanundergraddatabasecourseearlier
– CS316/equivalent?
• Areyoufamiliarwith– SQL?– RA?(σ, Π, ´, ⨝, r, È, Ç, -)– Keys, foreign keys?– Indexindatabases?– Logic:∧,∨,∀,∃,¬,∈, =>
– Transactions?– Map-reduce/Spark?
• Haveyoueverworkedwithadataset?– relationaldatabase,text,csv,XML
• Haveyoueverusedadatabasesystem?– PostGres,MySQL,SQLServer,SQLAzure
18DukeCS,Fall2017 CompSci516:DatabaseSystems
Whatwillbecovered?• Databaseconcepts
– DataModels,SQL,Views,Constraints,RA,Normalization
• Principlesandinternalsofdatabasemanagementsystems(DBMS)– Indexing,QueryExecution-Algorithms-Optimization,Transactions,
ParallelandDistributedQueryProcessing,MapReduce
• Advancedandresearchtopicsindatabases– e.g.Datalog,NOSQL,Datamining,Datawarehouse– Morewillbeaddedinthe“TBD”lectures
• Wewillgofastforsomebasictopicsindatabasescoveredinundergraddbcourses– Datamodel,SQL,RA– Butaskmetoslowdownifyouarenotfamiliarwiththem
19DukeCS,Fall2017 CompSci516:DatabaseSystems
WhatthiscourseisNOTabout
• Spark,AWS,clustercomputing…– PartiallycoveredinaHWandalecture
• Machinelearningbasedanalytics• Statisticalmethodsfordataanalytics• Python,R,…• Programming
DukeCS,Fall2017 CompSci516:DatabaseSystems 20
Background• Youshouldhavesomeunderstanding(attheCS
undergraduatelevel)– datastructure,discretemaths,algorithms– databases– orhavetolearntheseyourselfasnecessary
• Needtopickupnewcodingframeworkandprogramminglanguagesonyourown– andhowtoprocessdatausingthem– Homeworkassignmentswillmostlybeself-taught– …withhelpfromtheTA
• Willinvolvesomemathematicalandanalyticalreasoningtoo
DukeCS,Fall2017 CompSci516:DatabaseSystems 21
Whyshouldwecareaboutdatabases?
• Weareinadata-drivenworld
• “BigData”issupposedtochangethemodeofoperationforalmosteverysinglefield– Science,Technology,Healthcare,Business,Manufacturing,Journalism,Government,Education,…
• Wemustknowhowtocollect,store,process,andanalyzesuchdata
22DukeCS,Fall2017 CompSci516:DatabaseSystems
Whyshouldwecareaboutdatabases?
• From“BigData”wiki:“TheLargeHadronColliderexperimentsrepresentabout150millionsensorsdeliveringdata40 milliontimespersecond.Therearenearly600 millioncollisionspersecond.IfallsensordatawererecordedinLHC,….thisisequivalentto500quintillion(5×1020)bytesperday,almost200timesmorethanalltheothersourcescombinedintheworld.”
23
Science
DukeCS,Fall2017 CompSci516:DatabaseSystems
Whyshouldwecareaboutdatabases?
• From“BigData”wiki:– eBay.com usestwodatawarehousesat7.5PB(x1012)and40PBaswellasa40PBHadoopclusterforsearch,consumerrecommendations,andmerchandising
– Facebookhandles50 billionphotosfromitsuserbase– AsofAugust2012,Googlewashandlingroughly100 billionsearchespermonth
24
Technology
DukeCS,Fall2017 CompSci516:DatabaseSystems
Whyshouldwecareaboutdatabases?
• From“BigData”wiki:– Healthcare:digitizationofpatient’sdata,prescriptiveanalytics
– Media:Tailorarticlesandadvertisementsthatreachtargetedpeople,validateclaims
• “ComputationalJournalism”projectinDukeDBgroup
– Manufacturing:supplyplanning– Sports:improvetraining,understandingcompetitors
25
HealthcareMediaManufacturingSports…..
DukeCS,Fall2017 CompSci516:DatabaseSystems
Whyshouldwecareaboutdatabases?
• Simplystoringsuchlargedatasetsinaflatfilestopsworkingatsomepoint– Needefficientmodel,storage,andprocessing
• ADBMStakescareofsuchissues– theuseronlyhastorunqueriestoprocesssuchdatasets– muchsimplerthanwritinglowlevelcode
26DukeCS,Fall2017 CompSci516:DatabaseSystems
Today
• DBMS• DataModels
• [RG]1.1,1.3-1.5
27DukeCS,Fall2017 CompSci516:DatabaseSystems
WhatisaDatabase?
• Adatabaseisacollectionofdata– typicallyrelatedanddescribingactivitiesofanorganization
• Adatabasemaycontaininformationabout– Entities
• students,faculty,courses,classroom
– Relationshipsbetweenentities• students’enrollment,facultyteachingcourses,roomsforcourses
28DukeCS,Fall2017 CompSci516:DatabaseSystems
Andwhatdoesitcontain?
WhyuseaDBMS• i.e.whynotusefilesystemandaprogramminglanguage?
• Supposeacompanyhasalargecollectionofdataonemployees,departments,products,salesetc.
• Requirements:– Quicklyanswerquestionsondata
• Notethatallthedatamaynotfitinmainmemory– Concurrentaccess:applychangesconsistently– Restrictedaccess(e.g.salary)
29DukeCS,Fall2017 CompSci516:DatabaseSystems
WhyuseaDBMS?
• ADBMSisapieceofsoftware(i.e.abigprogramwrittenbysomeoneelse)thatmakesthesetaskseasier– Quickaccess– Robustaccess– Safeaccess– Simpleraccess
• Next:somenicepropertiesofaDBMS
30DukeCS,Fall2017 CompSci516:DatabaseSystems
WhyuseaDBMS?
1. DataIndependence– Applicationprogramsshouldnotbeexposedtothedata
representationandstorage– DBMSprovidesanabstractviewofthedata
2. EfficientDataAccess– ADBMSutilizesavarietyofsophisticatedtechniquesto
storeandretrievedata(fromdisk)efficiently
31DukeCS,Fall2017 CompSci516:DatabaseSystems
WhyuseaDBMS?
3. DataIntegrityandSecurity– DBMSenforces“integrityconstraints”– e.g.check
whethertotalsalaryislessthanthebudget– DBMSenforces“accesscontrols”– whethersalary
informationcanbeaccessesbyaparticularuser
4. DataAdministration– Centralizedprofessionaldataadministrationby
experienceduserscanmanagedataaccess,organizedatarepresentationtominimizeredundancy,andfinetunethestorage
32DukeCS,Fall2017 CompSci516:DatabaseSystems
WhyuseaDBMS?
5. ConcurrentAccessandCrashRecovery– DBMSschedulesconcurrentaccessestothedatasuch
thattheusersthinkthatthedataisbeingaccessedbyonlyoneuseratatime
– DBMSprotectsdatafromsystemfailures
6. ReducedApplicationDevelopmentTime– Supportsmanyfunctionsthatarecommontoanumber
ofapplicationsaccessingdata– Provideshigh-levelinterface– Facilitatesquickandrobustapplicationdevelopment
33DukeCS,Fall2017 CompSci516:DatabaseSystems
WhenNOTtouseaDBMS?• DBMSisoptimizedforcertainkindofworkloadsand
manipulations
• Theremaybeapplicationswithtightreal-timeconstraintsorafewwell-definedcriticaloperations
• AbstractviewofthedataprovidedbyDBMSmaynotsuffice
• Toruncomplex,statistical/MLanalyticsonlargedatasets
34DukeCS,Fall2017 CompSci516:DatabaseSystems
DataModel• Applicationsneedtomodelsomerealworldunits• Entities:
– Students,Departments,Courses,Faculty,Organization,Employee,…
• Relationships:– Courseenrollmentsbystudents,Productsalesbyanorganization
• Adatamodelisacollectionofhigh-leveldatadescriptionconstructsthathidemanylow-levelstoragedetails
35DukeCS,Fall2017 CompSci516:DatabaseSystems
DataModelCanSpecify:
1. Structureofthedata– likearraysorstructs inaprogramminglanguage– butatahigherlevel(conceptualmodel)
2. Operationsonthedata– unlikeaprogramminglanguage,notanyoperationcanbeperformed– allowlimitedsetsofqueriesandmodifications– astrength,notaweakness!
3. Constraintsonthedata– whatthedatacanbe– e.g.amoviehasexactlyonetitle
36DukeCS,Fall2017 CompSci516:DatabaseSystems
ImportantDataModels
• StructuredData• Semi-structuredData• UnstructuredData
Whatarethese?
37DukeCS,Fall2017 CompSci516:DatabaseSystems
ImportantDataModels• StructuredData
– Allelementshaveafixedformat– RelationalModel(table)
• Semi-structuredData– Somestructurebutnotfixed– Hierarchicallynestedtagged-elementsintreestructure– XML
• UnstructuredData– Nostructure– text,image,audio,video
38DukeCS,Fall2017 CompSci516:DatabaseSystems
RelationalDataModel
• ProposedbyEdward(Ted)Codd in1970– wonTuringawardforit!
• Motivation:– Simplicity– Betterlogicalandphysicaldataindependence
DukeCS,Fall2017 CompSci516:DatabaseSystems 39
RelationalDataModel
• ThedatadescriptionconstructisaRelation– Representedasa“table”– Basicallya“set”ofrecords(setsemantic)– orderdoesnotmatter– andallrecordsaredistinct
• however,itistruefortherelationalmodel,notforstandardDBM– allowduplicaterows(bagsemantic)– unlessrestrictedbykeyconstraints.Why?
40DukeCS,Fall2017 CompSci516:DatabaseSystems
sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0
Students
Bag:{1,1,2,2,3,2,1,5,6,1}Set:{1,2,3,5,6}
Bagvs.Set
• Why“bagsemantic”andnot“setsemantic”instandardDBMSs?– Primarilyperformancereasons– Duplicateeliminationisexpensive(requiressorting)– Someoperationslike“projection”s aremuchmoreefficientonbags
thansets
41DukeCS,Fall2017 CompSci516:DatabaseSystems
sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0
Students
RelationalDataModel
42DukeCS,Fall2017 CompSci516:DatabaseSystems
sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0
Students Attribute/Column/Field
Tuple/Row/Record
Value
Whatisapoorlychosenattributeinthisrelation?
• Relationaldatabase=asetofrelations• ARelation:madeupoftwoparts
1. Schema2. Instance
SchemaandInstance• Oneschemacanhavemultipleinstances
• Schema:– Atemplatefordescribinganentity/relationship(e.g.students)– specifiesnameofrelation+nameandtypeofeachcolumne.g.Students(sid:string,name:string,login:string,age:integer,gpa:real).
• Instance:– Whenwefillinactualdatavaluesinaschema– atable,hasrowsandcolumns– eachrow/tuplefollowstheschemaanddomainconstraints– #Rows=cardinality,#fields=degreeorarity– examplebelow
DukeCS,Fall2017 CompSci516:DatabaseSystems 43
Cardinality = 3, degree = 5sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@ee 18 3.2
53650 Smith smith1@math 19 3.8
LevelsofAbstractionsinaDBMS
• Physicalschema– Storageasfiles,rowvs.
columnstore,indexes– willdiscussthesein
laterlectures
DukeCS,Fall2017 CompSci516:DatabaseSystems 44
Disk
PhysicalSchema
LogicalSchema
ExternalSchema External Schema ExternalSchema
LevelsofAbstractionsinaDBMS
• Logical/Conceptualschema– describesthestoreddatainthe
physicalschema
• Decidedbyconceptualschemadesign
– e.g.ERDiagram• notcoveredinthiscourse
– Normalization• willbecovered
Students(sid:string,name:string,login:string,age:integer,gpa:real)
DukeCS,Fall2017 CompSci516:DatabaseSystems 45
Disk
PhysicalSchema
LogicalSchema
ExternalSchema External Schema ExternalSchema
LevelsofAbstractionsinaDBMS
• Externalschema– different“views”ofthe
databasetodifferentusers
– willdiscussviewslater
• Onephysicalandlogicalschemabuttherecanbemultipleexternalschemas
DukeCS,Fall2017 CompSci516:DatabaseSystems 46
Disk
PhysicalSchema
LogicalSchema
ExternalSchema External Schema ExternalSchema
DataIndependence
• Applicationprogramsareinsulatedfromchangesinthewaythedataisstructuredandstored
• AveryimportantpropertyofaDBMS
• LogicalandPhysical
DukeCS,Fall2017 CompSci516:DatabaseSystems 47
LogicalDataIndependence• Userscanbeshieldedfromchangesinthelogical
structureofdata• e.g.Students:
Students(sid:string,name:string,login:string,age:integer,gpa:real)• Divideintotworelations
Students_public(sid:string,name:string,login:string)Students_private(sid:string,age:integer,gpa:real)
• Stilla“view”Studentscanbeobtainedusingtheabovenewrelations– by“joining”themwithsid
• AuserwhoqueriesthisviewStudentswillgetthesameanswerasbefore
DukeCS,Fall2017 CompSci516:DatabaseSystems 48
PhysicalDataIndependence
• Thelogical/conceptualschemainsulatesusersfromchangesinphysicalstoragedetails– howthedataisstoredondisk– thefilestructure– thechoiceofindexes
• Theapplicationremainsunaltered– Buttheperformancemaybeaffectedbysuchchanges
DukeCS,Fall2017 CompSci516:DatabaseSystems 49
Veryimportant
UnderstandtheCourse-Policy
See“whatisallowed/notallowed”
willberemindedineveryhwassignmenttoo
DukeCS,Fall2017 CompSci516:DatabaseSystems 50
Recommended