Upload
vuongque
View
226
Download
0
Embed Size (px)
Citation preview
Foundations of Databases(Datenbanken I)
Prof. Dr. Torsten Grust
U Tubingen, Database Systems
Summer term 2009
Welcome to the completely rectangular world of. . .2
. . . Relational Database Management Systems (RDBMSs).
What this course is about:
• Convince you that there is more to database technology
than just open-file(), read()/write(), close-file().
• Make you see how versatile the strictly tabular data model
supported by relational databases can be.
• Make you best friends with SQL, the principal language
spoken by relational database systems.
• We will encounter a healthy mix of good, clean theory and
highly relevant CS practice—knowledge of RDBMSs and SQL
makes you sexy (in a sense).
Administrativa (1)3
Lectures
Time slot Room
Monday, 1515–1645 Sand 6/7, gr. Horsaal
Tuesday, 0815–0945 Sand 6/7, gr. Horsaal
Practice
Time slot Room
Thursday, 1015–1145 Sand 6/7, kl. Horsaal
Will this be a good fit for most of you? Please speak up.
Administrativa (2)4
End-term Exam
• 90 mins. examination on Monday, July 20th, 1515.
• You may bring a A4 double-sided hand-written cheat
sheet.
• Passing earns you 6 ECTS.
Assignments and Grading
• We will distribute, collect, and grade weekly
assignments.
• You may—and you should—work in teams of two.
• Scoring 2/3 of the overall points in the assignments
earns you an additional 2 ECTS.
Administrativa (3)5
Course web home
http://www-db.informatik.uni-tuebingen.de/
teaching/ss09/db1
• Download slides
(PDF—please bring a print-out and take notes)
• Download assignments, sample tabular data, code
snippets, . . .
• Please check now and then (“. . . assignment unsolvable
as given. . . ”, “. . . no lecture on . . . ”, etc.)
• Contact information
Just drop by in our offices, send e-mail first if you
require specific help/longer attention.
These Slides6
Examples
Definitions
Code snippets
Quizzes
• A specific slide set suitable for printing (lighter colors, . . . )
will be available on the course web home.
Read a Book, Write some SQL7
Text book
(any introductory book is probably fine—ask me)
Alfons Kemper, Andre Eickler
Datenbanksysteme—Eine Einfuhrung
Oldenbourg Verlag
6th or 7th ed.
Install IBM DB2 V9.5 Express-C
• : Full-featured, fast, freely available.
• We will bring it with us for almost any lecture.
• Download @ http://db2express.com/
ActiveRecord (Ruby on Rails)9
• If time permits, we will close the course with an introduction
to and overview of ActiveRecord.
. ActiveRecord enables a truly seamless
embedding of database access and query
functionality into programming and scripting
languages (here: Ruby).
. You write Ruby fragments, the ActiveRecord framework
generates equivalent SQL commands for you.
. ActiveRecord is the glue between relational
databases and front-end web applications,
usually developed using Ruby on Rails.
Introduction10
• After completing this chapter, you should be able to:
. explain basic notions: database state, schema, query,
update, data model, DDL, DML,
. explain the role of the DBMS,
. explain data independence, declarativity, and the three
schema architecture,
. name different classes of users of a database application
system,
. name some DBMS tools.
Introduction11
Overview
1. Basic Database Notions
2. Database Management Systems (DBMS)
3. Programmer’s View, Data Independence
4. Database Users and Database Tools
Task of a Database (1)12
• What is a database? Difficult question. There is no precise
and generally accepted definition.
• Naıve approach:
The main task of the database system (DBS) is to answer
certain questions about a subset of the real world, e.g.
Questioning a DBS
Which homeworkshas Ann Smith
completed?
// DatabaseSystem
// 12
Task of a Database (2)13
• The DBS acts only as a storage for information. The
information must first be entered and then kept current.
Keeping a DBS current
Ann Smith hasdone Homework 3
and received10 points for it
// DatabaseSystem
// ok.
• A DBS is computerized version of a card–index box/filing
cabinet (but more powerful and efficient).
Task of a Database (3)14
• Normal database systems do not perform particularly
complicated computations on the stored data in order to
answer questions.
• However, a DBS can retrieve the requested data quickly from
a huge set of data (giga bytes, tera bytes, � main memory
size).
• A DBS can also aggregate/combine several pieces of stored
data to answer more complex questions (“Compute the
average points for Homework 3.”)
Task of a Database (4)15
• Above, the question “Which homeworks has Ann Smith
completed? ” was shown in natural (English) language.
• Making machines understand natural language is a tough task
(and bears a large potential for misunderstandings).
• Therefore, questions (or queries) are normally written in a
formal language, these days typically SQL.
SQL
SQL ≡ Structured Query Language, development
started in 1986, current version SQL:2003.
Pronounced S–Q–L, or Sequel.
State, Query, Update16
• The set of stored data is called the database state:
CurrentState
Query
SELECT HOMEWORK FROM SOLVEDWHERE STUDENT = ’Ann Smith’
// Answer
• Entering, modifying, or deleting information changes the
database state:
CurrentState
Update
INSERT INTO SOLVEDVALUES (’Ann Smith’, 3, 10)
// NewState
Structured Information (1)17
• Each database can store only information of a predeclared
structure (a limited domain of discourse):
Structure mismatch
Today’s specialin the cafeteria
is pizza.
// HomeworkDBS
// Error.
• Because the data are structured, not simply text, complex
query formulations are possible, e.g. “How many homeworks
has each student done?”
Structured Information (2)18
• Actually, a database system stores only plain data (character
strings, numbers), and not information.
• Data becomes information by interpretation.
• Therefore, real–world concepts like students, homework,
cafeterias, etc., need to be defined/declared before the
database can be used.
A pure text database?
Which types of questions could we pose on a DBS
storing text (character strings) only with no further
structure provided?
State vs. Schema (1)19
• Database Schema:
. Formal definition of the structure of the database
contents.
. Determines the possible database states.
. Defined only once (when the DB is created).
. In a programming language, this corresponds to variable
declaration (assigning a type to a variable).
Variable declaration
Example: variable declaration in C: short int i
Possible states of variable i? -32768 6 i 6 32767
State vs. Schema (2)20
• Database State (Instance of the Schema):
. Contains the actual data, structured according to the
schema.
. Changes often
(whenever database information is updated).
. Corresponds to current contents/value of a programming
language variable.
Variable state change
In state s, variable i has value 41. Now perform
state change (s to s ′) via assignment i = i + 1.
State vs. Schema (3)21
• In the relational model, the data is structured in form of
tables (relations).
• Each table has a name, a sequence of named columns
(attributes), and a set of rows (tuples).
A table
SOLVEDoDB Schema
STUDENT HOMEWORK POINTS
Ann Smith 1 10 )DB State
Ann Smith 2 8
Michael Jones 1 9
Michael Jones 2 9
Data Model (1)22
• Defines a formal language (syntax & semantics) for
. declaring database schema
. querying the current database state
. changing the database state.1
• Examples:
(Network Model, Hierarchical Model), Relational Model,
Entity Relationship Model, Object–Oriented Models,
UML, XML.
1“Data model” is, regrettably, widely used for “Database schema”.
Introduction23
Overview
1. Basic Database Notions
2. Database Management Systems (DBMS)
3. Programmer’s View, Data Independence
4. Database Users and Database Tools
DBMS (1)24
• A Database Management System (DBMS) is an
application–independent software system that implements a
data model, i.e., allows for
. definition of a DB schema for some concrete application,
. storage of an instance of this schema on, e.g., a disk,
. querying the current instance (database state),
. changing the database state.
Application–independent vs. concrete application
Since a DBMS is application–independent, how will the DBMS
ensure to interpret the stored application data correctly?
DBMS (2)25
• Normal users do not need to use SQL for their daily tasks of
data entry or data lookup.
• These users use application programs that have been
developed specifically for this task and offer a more accessible
user interface.
• Internally, these application programs translate the user
requests into SQL statements (queries, updates) in order to
communicate with the DBMS.
DBMS (3)26
• Often, several different application programs are used to
access the same centralized database.
• For example, the Homework DBS might provide:
. A read–only web interface for students.
. A program used by the TA (Hiwi) to load homework and
exam points.
. A program that prints a report for the professor used to
assign grades.
• The interactive SQL interface (SQL console) that comes
with the DBMS is simply yet another way to access the
DBMS.
DBMS (4)27
User AOO
��
User BOO
��
Application ProgramOO
��
DBMS Tool (e.g., SQL console)OO
��
Database Management System (DBMS)
DB Schema
��
OO
DB State
��
OO
DB Application Systems (1)28
• Often, different users access the same database concurrently
(i.e., at the same time, touching the same data).
• The DBMS is usually implemented as a background server
process (or set of such processes) that is accessed over the
network by application programs (clients).
• One can also view the DBMS as an extension of the
operating system (a more powerful file system).
DB Application Systems (2)29
Client–Server Architecture
Client
User A(Application)
Client
User B(SQL console)
Network
Server
DBMS
DB Application Systems (3)30
Three-Tier Architecture
Thin client
User A(Browser)
Thin client
User B(Browser)
ApplicationServer
Application
Web Server
Server
DBMS
DB Application System (4)31
• A recap of database vocabulary:
. A database (DB) consists of a DB schema and a DB state.
. A database management system (DBMS) is a software
system that implements a data model (e.g., a Relational
DBMS (RDBMS) implements the relational model).
. A database system (DBS) consists of a DBMS and a
database.
. A database application system consists of a DBS and a
set of application programs.
Introduction32
Overview
1. Basic Database Notions
2. Database Management Systems (DBMS)
3. Programmer’s View, Data Independence
4. Database Users and Database Tools
Persistent Storage (1)33
• Today:
5 // factorial // 120
• Tomorrow:
5 // factorial // 120
⇒ To evaluate factorial (n 7→ n!), no persistent storage is
necessary. The output is a function of the input only.
Persistent Storage (2)34
• Today:
Ann // Homework points // 20
• Tomorrow:
Ann // Homework points // 30
⇒ The output is a function of the input and a persistent
state.
Persistent Storage (3)35
A DBS provides persistent state
InputAnn
// Homework points // Output30
Persistent state
Persistent information
Information that lives longer than a single process (program
execution). Survives power outage and a reboot of the
operating system.
Persistent Storage (4)36
Which of the following processes/devices need persis-
tent storage? If so, for which particular task?
1○ Web browser
2○ Pocket calculator
3○ Mobile phone
4○ Screen saver
5○ DVD recorder
Typed Persistent Data (1)37
• Classical way to implement persistence:
. Information needed in subsequent program invocations is
saved into a file.
. The operating system (OS) maintains the file on disk.
. Disks provide persistent memory: the contents is not lost if
the machine is switched off or the OS is rebooted.
OS files and persistence
The above statement is basically true but care should
be taken nevertheless. Why?
. File systems are predecessors of modern DBMS.
Typed Persistent Data (2)38
• Implementing persistence with files:
. OS files are usually nothing but sequences of bytes.
. A record structure must be defined on top of this (much
like in Assembler languages):
0 40 42 44
A n n S m i t h . . . 0 3 1 0
. The record and file structure is contained only in the
programmers’ heads.�
. The OS file system cannot prevent misinterpretation,
overflows, etc., because it is not aware of the file structure
Typed Persistent Data (3)39
• Implementing Persistence with a DBMS:
. The structure of the information to be stored must be
defined in a way the DBMS understands:
SQL DDL command
CREATE TABLE SOLVED (STUDENT VARCHAR(40),
HOMEWORK NUMERIC(2),
POINTS NUMERIC(2))
. The file structure is formally documented.
. The system can detect type errors in application programs.
. Simplified programming (higher abstraction level).
A Subprogram Library (1)40
• Most DBMSs use OS files to store the data. (Some use raw
disk device access.)
• One can view a DBMS as a subprogram library that can be
used for file access.
• Compared with the direct OS system calls for file access, the
DBMS offers higher level operations.
• The DBMS offers a wide varietry of algorithms that one
would otherwise have to program.
A Subprogram Library (2)41
• For instance, a typical Relational DBMS contains routines for
. sorting (e.g., external merge sort),
. searching (e.g., B-trees),
. file space management, buffer management,
. aggregation, statistical evaluation.
• The algorithms are optimized for large data sets (that do not
fit into main memory).
• The DBMS also offers multi-user support (locking) and
safety measures to protect data against system crashes.
Data Independence (1)42
• The DBMS is a layer of software above the OS files. The
files can be accessed only via the DBMS.
• The DBMS may change the file structure internally (reorder
records, splits files, etc.) for performance reasons.
This goes unnoticed by the application program.�
• Compare with the idea of abstract data types:
The implementation changes, the interface is kept stable.
Data Independence (2)43
• Typical example:
. At the beginning, a professor used the homeworks DB only
for his courses in the current term.
. Since the DB was small and there were relatively few
accesses, it was sufficient to store the data as a heap file.
. Later, the entire university used the DB, and information of
previous courses had to be kept for some time.
. DB size grows significantly, DB access much more
frequently.
. An index file (e.g., a B-tree) is now needed to provide fast
access.
Data Independence (3)44
• Without DBMS:
. Using the new B-tree index to access the file must be
explicitly built into the lookup (query) commands.
. Thus, application programs need to be changed if the mode
of file access is changed.
. If one forgets to change a seldolmly used application
program, and this program does not update the index when
the data has been updated, the DB becomes inconsistent.
Data Independence (4)45
• With Relational DBMS:
. Already at the interface, the system completely hides the
(non-)existence of indexes on files.
. Queries and updates do not have to and cannot refer to
indexes.
. The system automatically
1○ modifies the index in case of data updates,
2○ uses the index to evaluate queries against the indexed
data when advantageous.
Data Independence (5)46
• Conceptual Schema (“interface”):
. Only logical information content of the database, relevant
to the subset of the real world modelled in the DB.
. Simplified view of the DB: physical storage details hidden.
• Internal/Physical Schema (“implementation”):
. Indexes,
. Division of tables among disks,
. Storage management if tables grow or shrink,
. Placement of new rows in a table (sort order, clustering).
Data Independence (6)47
1○ The user enters a query (e.g., in SQL) that only refers to
the conceptual schema.
2○ The DBMS translates this into a query/program (execution
plan) which refers to the the internal schema.
This is done by the the query optimizer.
3○ The DBMS executes the translated query on the persistent
instance of the internal schema.
4○ The DBMS translates the result back to the conceptual
level.
Back-translation?
Why would this be necessary and what would be
typical back-translation steps?
Data Independence (7)48
Changing the internal schema
Conceptual Schema
New Translation
QQQQQQQQQQQ Same Conceptual Schema
Old Internal Schema(no B-tree index)
// New Internal Schema(with B-tree index)
Declarative Languages (1)49
• Physical data independence requires that the query language
(SQL) cannot refer to indexes.
• Declarative query languages go one step further:
. Queries should only describe what information is sought,
. but should not prescribe any particular method how to
compute/retrieve the desired information.
Kowalski
Algorithm = Logic + Control
Imperative/Procedural Languages: explicit control, implicit logic
Declarative/Descriptive Laguages: implicit control, explicit logic
Declarative Languages (2)50
• SQL is a declarative language. The user describes conditions
the requested data is required to fulfill:
SQL query
SELECT X.POINTS
FROM SOLVED X
WHERE X.STUDENT = ’Ann Smith’
AND X.HOMEWORK = 3
• Ofter, simpler formulations of the same query are possible,
with SQL users do not have to think about efficient execution.
• More concise than imperative programming: less expensive
program development and maintenance.
Declarative Languages (3)51
• Declarative query languages
. allow powerful optimizers
(no evaluation method is prescribed)
. need powerful optimizers
(naıve evaluation is almost always too inefficient).
• Independence of current hardware technology and software
quality:
. Today’s queries will use tomorrow’s DBMS setup and
algorithms when a new version of the DBMS is released.
Logical Data Independence (1)52
• Logical data independence allows for changes to the logical
information content of the database.
• Such changes are obviously restricted to additions to the
logical information content.
. Example: add column SUBMISSION DATE to table SOLVED.
• Such additions may be required for new applications.
• It should not be necessary to change old applications only
because records now contain additional information.
Logical Data Independence (2)53
• Logical data independence is important when there are
application programs with distinct, but overlapping
information needs.
• Logical data independence also helps to integrate previously
distinct databases.
. In earlier times, every department of a company had its own
DB/data files.
. Now, businesses generally aim at one central DB.
Logical Data Independence (3)54
• If a company uses more than one DB, the information in
these databases will normally overlap, i.e., some pieces of
information will be stored several times.
• Data is called redundant if it can be derived from other data
and knowledge internal to the application.
• Problems with redundancy:
. Duplicates data entry and update efforts.
. Sooner or later, data copies will get out-of-sync and thus
inconsistent.
. Wastes storage space, also on backup media.
Logical Data Independence (4)55
• External Schemas/Views:
. Logical data independence requires a third level of database
schemas, the external schemas or views.
. Each user (department, . . . ) may have an individual view
of the data.
. An external view contains a subset of the information in
the database, maybe slightly restructured.
Views may also be vital because of security reasons.
. In contrast, the conceptual schema describes the complete
information content of the database.
Three–Schema Architecture56
Three–Schema Architecture [ANSI/Sparc 1978]
User User
External Schema 1VVVVVVV
. . . External Schema nhhhhhhh
Conceptual Schema
Stored data
Internal Schema
More DBMS Functions (1)57
• Transactions:
. Sequences of DB commands (queries and updates) are
executed as an atomic unit (“all or nothing”).
� DBMS may crash during/after a sequence of commands
is/has been executed. The DBMS then performs
undo/redo.
. Support for backup and recovery.
. Support of concurrent users.
� Each user is given the illusion to be the only DB user at
any time. DBMS performs locking and conflict detection.
More DBMS Functions (2)58
• Security:
. Access rights: Who may perform which operations on
which table?
. Auditing: DBMS remembers who did what/when.
• Integrity:
. The DBMS checks that the entered data is
plausible/complete (such checks may also span several
tables).
. DBMS rejects updates (insertions and deletions) which
would violate defined business rules.
More DBMS Functions (3)59
• Data Dictionary:
. Metadata (“data about data”, schema, user list, access
rights) is availble in system tables, e.g.:
System tables
SYS TABLESTABLE NAME OWNER
SOLVED GRUSTSYS TABLES SYSSYS COLUMNS SYS
SYS COLUMNSTABLE NAME SEQ COL NAME
SOLVED 1 STUDENTSOLVED 2 HOMEWORKSOLVED 3 POINTS
SYS TABLES 1 TABLE NAMESYS TABLES 2 OWNERSYS COLUMNS 1 TABLE NAMESYS COLUMNS 2 SEQSYS COLUMNS 3 COL NAME
Introduction60
Overview
1. Basic Database Notions
2. Database Management Systems (DBMS)
3. Programmer’s View, Data Independence
4. Database Users and Database Tools
Database Users (1)61
• Database Administrator (DBA):
. Should know about all schemas, may change the conceptual
and the internal schema (creates tables, creates/drops
indexes). Can damage everything.
. Gives access rights to users. Ensures security.
. Monitors system performance.
(Transaction throughput #TX/s, # concurrent users, index
sizes, . . . )
. Monitors available disk space and installs new disks.
. Ensures that backup copies of the data are made. Does
recovery after disk failures, etc.
Database Users (2)62
• Application Programmer:
. Writes programs for standard, all-day tasks, to be used by
the naıve users (see below):
� safe data entry,� report generation,� data browsing.
. Knows SQL well, plus programming languages and
development tools.
. Usually supervised by DBA.
. Might do conceptual schema design (knows which table the
application will need to access/create).
Database Users (3)63
• Sophisticated User (one kind of “end user”):
. Knows SQL and/or some query tools, may use SQL console.
. Does non-standard aggregations/evaluations of the data
without help from application programmers.
. May generate complex queries.
• Naıve User (the other kind of “end user”):
. Uses DB only via application programs, often unaware of
existence of DBMS back-end.
. Primarily data entry user, simple browsing-style queries
against external views.
Database Tools64
• Interactive SQL console
• Graphical/menu-based query tools
• Interface for DB access from standard programing
lanugages (C, C++, Java)
• Tools for form-based DB application (4GL)
• Report generators
• Web interface
• Tools for data import/export, backup & recovery,
performance monitoring, . . .
Summary (1)65
• Functions of database systems:
. Persistence
. Integration/Redundancy Avoidance
. Physical and Logical Data Independence
. Subprogram Library: many algorithms built-in, especially
tuned for external memory access (disks)
. Query and Update evaluation
Summary (2)66
• Functions of database systems, continued:
. High data safety and availability (Backup & Recovery)
. Combinations of operations into atomic transactions
. Multi-user support: synchronization of concurrent accesses
. Integrity Enforcement
. View management
. Security via data access control
. System catalog management (metadata)
Summary (3)67
• The main goal of the DBMS is to give the user a simplified
view on the persistent storage, i.e., to hide any complications
introduced by the DBMS physical layer.
• The user does not worry about
. physical storage details
. different information needs of other users
. efficient query formulation
. possibility of system crashes/disk failures
. presence of concurrent users accessing identical data
subsets.
Exercise68
Data-intensive programming
• Suppose homework points data is stored in a
line-structured text file with the format
Student Name:Homework Number:Points
e.g.,
Ann Smith:3:10
• Suppose you have to write a C program that prints the
total number of points per student (students sorted
alphabetically).
• How would you judge the programming effort (in terms
of lines and time)?
• In SQL, this takes 4 lines and approx. 1 1/2 minutes.