Overview of Databases -...

Preview:

Citation preview

Overview of Databases

Gerome MiklauCMPSCI 645 – Database Design & Implementation

UMass AmherstFeb 1, 2006

Some slide content courtesy of Zack Ives, Ramakrishnan & Gehrke, Dan Suciu, Ullman & Widom

Today

• Student information form• Overview of databases• Course topics• Course requirements

Databases & DBMS’s• A database is a large, integrated collection of

data.

• A database management system (DBMS) is a software package designed to store and manage databases, allowing:– Define the kind of data stored– Querying/updating interface– Reliable storage & recovery of 100s of GB– Control access to data from many concurrent users

Can filesystems do it?

• Schema for files is limited• No query language for data in files• Files can store large amounts of data, but

– no recovery from failure– no efficient access to items within file

• Concurrent access not safe

No

Evolution

• Early DBMS’s (1960’s), evolved from file systems.

• Data with many small items & many queries or modifications:– Airline reservations– Banking

Early DB systems

• Tree-based hierarchical data model• Graph-based network data model

• Encouraged users to think about data the way it was stored.

• No high level query language

Data model The data model includes basic assumptions about what’s

an “item” of data, how to represent it and interpret it.

The Relational Model•The relational data model (Codd, 1970):

– Data independence: details of physical storage are hidden from users

– High-level declarative query language• say what you want, not how to compute it. • mathematical foundation

– A theory of normalization guides the design of relations

Side-note: Turing Awards in Databases1973: Bachman, networked data model 1981: Codd, relational model1998: Jim Gray, transaction processing

DBMS Benefit #1: Generality and Declarativity

• The programmer or user does not need to know details like indices, sort orders, machine speeds, disk speeds, concurrent users, etc.

• Instead, the programmer/user programs with a logical model in mind

• The DBMS “makes it happen” based on an understanding of relative costs of different methods

Benefit #2: Efficiency and Scale

• Efficient storage of hundreds of GBs of data

• Efficient access to data

• Rapid processing of transactions

Benefit #3: Management of Concurrency and Reliability

• Simultaneous transactions handled safely.• Recovery of system data after system failure.

• More formally: the ACID properties– Atomicity - all or nothing– Consistency - sensible state not violated– Isolation - separated from effects– Durability - once completed, never lost

How Does One Build a Database?

• Start with a conceptual model• Design & implement schema• Write applications using DBMS and other

tools– Many ways of doing this (DBMS, API writers,

library authors, web server, etc.)– Common applications include PHP/JSP/servlet-

driven web sites• The DBMS takes care of query optimization

and execution

Conceptual Design

STUDENT COURSETakes

namesid cid name

PROFESSOR

Teaches

semester

fid name

Designing a Schema (Set of Relations)

• Convert to tables +constraints

• Then need to do “physical” design: the layout on disk, indices, etc.

sid name1 Jill2 Bo3 Maya

fid name1 Diao2 Saul8 Weems

sid cid1 6451 6833 635

cid name sem645 DB F05683 AI S05635 Arch F05

fid cid1 6452 6838 635

STUDENT Takes COURSE

PROFESSOR Teaches

Queries

• Find all courses that “Mary” takes

• What happens behind the scene ?– Query processor figures out how to answer

the query efficiently.

SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.sid = T.sid and T.cid = C.cid

Queries, behind the scene

Query execution plan:

SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.sid = T.sid and T.cid = C.cid

Declarative SQL query

Students Takes

sid=sid

sname

name=“Mary”

cid=cid

Courses

The optimizer chooses the best execution plan for a query

An Issue: 80% of the World’s Data is Not in a DB!

Examples: – Scientific data

(large images, complex programs that analyze the data) – Personal data– WWW and email

(some of it is stored in something resembling a DBMS)Data management is expanding to tackle these

problems

DBMSs in the Real WorldA huge industry for 20% of the world’s data!• Big, mature relational databases

– IBM DB2, Oracle, Microsoft SQL Server– Adding advanced features, including “native XML” support

• “Middleware” above these systems– SAP, Siebel, PeopleSoft, dozens of special-purpose apps

• Integration and warehousing systems– BEA AquaLogic, DB2 Information Integrator

• Current trends:– Web services; XML everywhere– Smarter, self-tuning systems

Database Research

• One of the broadest, most exciting areas in CS!• A microcosm of CS in general

• languages, operating systems, concurrent programming, data structures, algorithms, theory, distributed systems, statistical techniques.

• Theory and systems well-integrated.

Recent Trends in Databases

• XML– Relational databases with XML support– Middleware between XML and relational databases– Large-scale XML message systems

• Main memory database systems• Peer data management• Stream data management• Model management, provenance• Security and privacy• Modeling uncertainty, probabilistic databases

What is the Field of Databases ?

• To an applied researcher (SIGMOD/VLDB/ICDE)– Query optimization– Query processing (yet-another join algorithm)– Transaction processing, recovery (but most stuff is already

done)– Novel applications: data mining, high-dimensional search

• To a theoretical researcher (PODS/ICDT/LICS)– Focus on the query languages– Query language = logic = complexity classes

Course topics

• Fundamentals: relational design, query languages.

• Database internals: storage, indexing, query processing, query optimization, transaction management.

• Theory: expressiveness of query languages, static analysis, complexity.

• XML and semi-structured data models.• Security: access control, privacy.• Advanced topics: streaming data,

information integration.

Prerequisites

• Official: undergrad course in DB or OS• Also:

– Elementary complexity theory

Grading

• Homework: 20%• Paper reviews: 10%• Project: 25%• Midterm: 20%• Final: 25%

Homework: 20%

• 4 assignments throughout the course– written problem sets– practical experience with SQL, XQuery

Paper Reviews: 10%

• Approximately 5 classic papers will be assigned

• Short written reviews are due before the day of class. Email to: – cs645-reviews@cs.umass.edu

First paper review:Read Sec 1 of Codd’s paper Due Wed Feb 8th

Project: 25%• General theme: apply database principles to a new

problem• Suggested topics will be discussed next Monday• Groups of 2 preferred. 3 possible.• Project work will include:

– Reading some of the research literature– Implementation– Written report– In-class presentation

• Periodic consultation with one of the instructors

Exams

• Midterm (20%)– in-class, Monday, Apr 3

• Final (25%)– Thursday, May 25, 10:30am

Textbook

Database Management Systems Ramakrishnan and Gehrke

Other useful resources• Database systems: the complete book (Ullman,

Widom and Garcia-Molina)• Readings in Database Systems (Stonebraker and

Hellerstein)• Foundations of Databases (Abiteboul, Hull, Vianu)• Data on the Web (Abiteboul, Buneman, Suciu)• Parallel and Distributed DBMS (Ozsu and Valduriez)• Transaction Processing (Gray and Reuter)• Data and Knowledge based Systems (volumes I, II)

(Ullman)• Proceedings of SIGMOD, VLDB, PODS conferences.

Communication

• Instructors– Office hours: (see website)– Email: cs645-instructor@edlab-mail.cs.umass.edu

• Check the course webpage often• Course bulletin board, mailing list

31

Questions about the course?

32

Recommended