27
Rutgers University CS541: Database Systems Spring 2008 Computer Science Department Rutgers University

CS541: Database Systems Spring 2008

  • Upload
    stacia

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

CS541: Database Systems Spring 2008. Computer Science Department Rutgers University. Administration. Instructor: Amélie Marian [email protected] CoRE 324 (732) 445 6450 x0636 Office Hours: Mondays 3-4pm or by appointment TA: Minji Wu minji-wu @cs.rutgers.edu Office Hours: TBA. - PowerPoint PPT Presentation

Citation preview

Page 1: CS541: Database Systems Spring 2008

Rutgers University

CS541: Database SystemsSpring 2008

Computer Science DepartmentRutgers University

Page 2: CS541: Database Systems Spring 2008

Rutgers University

Administration

Instructor: Amélie [email protected] 324(732) 445 6450 x0636Office Hours: Mondays 3-4pm or by

appointment TA: Minji Wu

[email protected] Hours: TBA

Page 3: CS541: Database Systems Spring 2008

Rutgers University

Class Information

Web page:http://www.cs.rutgers.edu/~amelie/courses/541Spring2008.html

Meets Thursday 3:20-6:20pm in CoRE A

Prerequisites:CS513 and working knowledge of C or

Java or instructor’s permission

Page 4: CS541: Database Systems Spring 2008

Rutgers University

Grading (subject to small changes)

15% Homework (3-4) Due at beginning of class on due date

30% Programming Project In teams of two (same project) In three parts

Find data source and scenario Implementation of standard index structures for query

processing Extend project to non-standard query processing (e.g.,

IR-style text retrieval, nearest-neighbor, top-k) In class project presentation and demonstration More details later

25% Midterm Exam Tentatively scheduled for March 13

30% Final Exam

Page 5: CS541: Database Systems Spring 2008

Rutgers University

Collaboration Policy

Check DCS Academic Integrity Policy

Homework and exams are to be done individually

Project is done only with your team partner

Page 6: CS541: Database Systems Spring 2008

Rutgers University

Supporting Material

Textbook:Raghu Ramakrishnan, Johannes Gehrke:

Database Management Systems, 3rd edition, McGraw-Hill, 2002

Class website: Lecture Notes Research Papers (for advanced topics)

Page 7: CS541: Database Systems Spring 2008

Rutgers University

Communication

Please send me email, come to my office hour, or contact Minij if you have questions on the material, complaints, or feedback on how to improve the course

Page 8: CS541: Database Systems Spring 2008

Rutgers University

Class Organization

Basics of Database Systems What is a DBMS? Why do we need one? How do we design one? What are the common problems in DBMS?

Information Management Text documents Structure and content Approximate querying

Advanced Topics in Data Management What is new and exciting in DB Research? How do we deal with huge amounts of data? What are the new challenges brought by the

internet? How should DBMS evolve?

Page 9: CS541: Database Systems Spring 2008

Rutgers University

Short History of Data Management

Evolved from file systems (1960’s) Airline reservation systems Banking systems Corporate data

Relational DB systems (1970’s) Data organized in tables and relations that model real-

world Storage structure transparent to user High-level query language Widely used today

New challenges Distributed Data (e.g., internet) Parallel Computing Bigger systems Multimedia Data Data Analysis Information Integration

Page 10: CS541: Database Systems Spring 2008

Rutgers University

What is a DBMS?

Powerful tool to efficiently manage large amounts of data Persistent storage (more flexible than a

file system) Data manipulation (complex query

language) Transaction management

(simultaneous access to data)

Page 11: CS541: Database Systems Spring 2008

Rutgers University

Why Use a DBMS?

Data independence and efficient access.

Reduced application development time.

Data integrity and security. Uniform data administration. Concurrent access, recovery from

crashes.

Page 12: CS541: Database Systems Spring 2008

Rutgers University

Why Study Databases?

Shift from computation to information at the “low end”: user-input information (a

mess!) at the “high end”: scientific applications

Datasets increasing in diversity and volume. Digital libraries, interactive video, Human

Genome project, EOS project ... need for DBMS exploding

DBMS encompasses most of CS OS, languages, theory, AI, multimedia, logic

Page 13: CS541: Database Systems Spring 2008

Rutgers University

Basics of Database Systems: The ER Model

Conceptual design of database Models real-world:

Entities (Students, Professor, and Classes) Relationships (Amélie Marian teaches 541) Attributes are associated with entities (the

room for 541 is CoRE A) Constraints of the data

Logical schema of the data

Page 14: CS541: Database Systems Spring 2008

Rutgers University

Basics of Database Systems:The Relational Model

A data model is a collection of concepts for describing data.

A schema is a description of a particular collection of data, using the a given data model.

The relational model of data is the most widely used model today. Main concept: relation, basically a table with rows

and columns. Every relation has a schema, which describes the

columns, or fields. Two formal query languages

Relational algebra Relational calculus

Powerful and widely used query language: SQL

Page 15: CS541: Database Systems Spring 2008

Rutgers University

Levels of Abstraction

Many views, single conceptual (logical) schema and physical schema. Views describe how users

see the data.

Conceptual schema defines logical structure

Physical schema describes the files and indexes used.

Schemas are defined using DDL; data is modified/queried using DML.

Physical Schema

Conceptual Schema

View 1 View 2 View 3

Page 16: CS541: Database Systems Spring 2008

Rutgers University

Example: University Database

Conceptual schema: Students(sid: string, name: string, login: string,

age: integer, gpa:real) Courses(cid: string, cname:string, credits:integer) Enrolled(sid:string, cid:string, grade:string)

Physical schema: Relations stored as unordered files. Index on first column of Students.

External Schema (View): Course_info(cid:string,enrollment:integer)

Page 17: CS541: Database Systems Spring 2008

Rutgers University

Data Independence *

Applications insulated from how data is structured and stored.

Logical data independence: Protection from changes in logical structure of data.

Physical data independence: Protection from changes in physical structure of data.

One of the most important benefits of using a DBMS!

Page 18: CS541: Database Systems Spring 2008

Rutgers University

Basics of Database Systems:Physical Storage and Index Structures

Many alternatives exist, each ideal for some situations, and not so good in others: Heap (random order) files: Suitable when

typical access is a file scan retrieving all records.

Sorted Files: Best if records must be retrieved in some order, or only a `range’ of records is needed.

Indexes: Data structures to organize records via trees or hashing.

Like sorted files, they speed up searches for a subset of records, based on values in certain (“search key”) fields

Updates are much faster than in sorted files.

Page 19: CS541: Database Systems Spring 2008

Rutgers University

Basics of Database Systems:Query Processing

What are the best algorithms to evaluate queries on data Performance issues: space/time

Algorithms for evaluating relational operators use some simple ideas extensively: Indexing: to retrieve small set of data Iteration: Sometimes, faster to scan all tuples even

if there is an index. (And sometimes, we can scan the data entries in an index instead of the table itself.)

Partitioning: By using sorting or hashing, we can partition the data and replace an expensive operation by similar operations on smaller inputs.

Ideally: Want to find best plan. Practically: Avoid worst plans!

Page 20: CS541: Database Systems Spring 2008

Rutgers University

Basics of Database Systems:Transaction Processing

Concurrent execution of user programs is essential for good DBMS performance. Because disk accesses are frequent, and

relatively slow, it is important to keep the cpu humming by working on several user programs concurrently.

Interleaving actions of different user programs can lead to inconsistency: e.g., check is cleared while account balance is being computed.

DBMS ensures such problems don’t arise: users can pretend they are using a single-user system.

Page 21: CS541: Database Systems Spring 2008

Rutgers University

Basics of Database Systems:Logical Data Management

Redundancy is at the root of several problems associated with relational schemas: redundant storage, insert/delete/update anomalies

Integrity constraints, in particular functional dependencies, can be used to identify schemas with such problems and to suggest refinements.

Main refinement technique: decomposition (replacing ABCD with, say, AB and BCD, or ACD and ABD).

Decomposition should be used judiciously: Is there reason to decompose a relation? What problems (if any) does the decomposition

cause?

Page 22: CS541: Database Systems Spring 2008

Rutgers University

Advanced Topics in Data Management:Information Retrieval and Web Search

Keyword search over text (unstructured) data

User Expectations: Many say “The first item shown should be what

I want to see!” This works if the user has the most

popular/common notion in mind, not otherwise. Widely used today Top-k query model

Page 23: CS541: Database Systems Spring 2008

Rutgers University

Advanced Topics in Data Management:Advanced Query Processing

New challenges: Proactive (and reactive) optimization Smart statistics collection to cope with

fast changes Approximate query answering Online query processing Important answers first (top-k queries,

skyline queries)

Page 24: CS541: Database Systems Spring 2008

Rutgers University

Advanced Topics in Data Management:XML and Web Data

No application interoperability in the web today: HTML not understood by applications

screen scraping brittle Database technology: client-server

still vendor specific New Universal Data Exchange Format:

XML XML = semi-structured data XML generated by applications XML consumed by applications Easy access: across platforms, organizations

Page 25: CS541: Database Systems Spring 2008

Rutgers University

Advanced Topics in Data Management:Data Mining

Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

Valid: The patterns hold in general.Novel: We did not know the pattern

beforehand.Useful: We can devise actions from the

patterns.Understandable: We can interpret and

comprehend the patterns.

Page 26: CS541: Database Systems Spring 2008

Rutgers University

Advanced Topics in Data Management:and more…

Distributed Databases Parallel Databases ORDBMS Data Cleaning Data Warehousing Data Streams …

Page 27: CS541: Database Systems Spring 2008

Rutgers University

Contact me for more information!

If you are interested in advanced DB topics…

For-credit research projects available Top-k query processing Scoring XML data Web data management