Archetypal Databases, or, What Is a DBMS? Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems September 9, 2008

Archetypal Databases,or, What Is a DBMS?

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Implementing Data Management Systems

September 9, 2008

Today Last time we saw the “vision” of the relational

database – decoupling “what” from “how” Storage scheme shouldn’t matter to the programmer … Only the operations to be performed

But how do we realize this vision? And what about all of the other aspects of a

DBMS? Concurrency, recovery Creating tables, performing updates, …

2

3

The INGRES System Interactive Graphics and Retrieval System

Probably the first usable RDBMS One of the first real projects built on UNIX

On a minicomputer, the PDP-11, which greatly constrained things due to limited memory

No low-level access to disk – prevented clustering in storage Based on a relational query language called QUEL

Many religious wars between QUEL and SEQUEL camps Stonebraker’s first commercialized project

Changed university IP rules forever

Today: was recently open-sourced by Computer Associates … but mostly overshadowed by MySQL and Postgres (a later

Stonebraker project)

4

Processes Precompiler converts C+EQUEL into a

C program that gets compiled Due to 64K process size limits,

needed to break things into 4 processes communicating via pipes

Even some of these needed to be broken into overlays

Later added a Process 2.5 to make 5 processes

C program

Lexer, parserconcurrencyquery mod.

Query processing

Utilities,recovery

2

3

4

5

System-R Probably a bit

closer to today’s DBMSs, at least at the low level – but didn’t run on UNIX

Based on years of experience with IMS and other IBM database systems At this point, prior to Selinger (note limited

optimization), Lindsay (not a huge performance focus)

Components were built to be generally reusable Default language was called SEQUEL, by Chamberlin

and Boyce

6

System-R Architecture

Relational StorageSystem (RSS)

Storage, concurrency, access paths (“images”, “links”),

triggers

RelationalData System (RDS)

parsing, (limited) optimization

SEQUEL or e-SEQUEL(RDI)

(RSI)

7

Languages QUEL vs. SEQUEL

The focus of many religious wars, though they borrowed each other’s ideas

Ultimately IBM won due to (1) market presence, (2) Oracle SEQUEL had NULL concept, bag semantics, aggregation

Postgres was originally based on QUEL, hence PostgreSQL

QUEL: more orthogonal,simple EQUEL embedding

RANGE OF X IS MYRELRETRIEVE (A.B)WHERE … ## C-block

SEQUEL: more block-oriented,embedded via cursors and row sets, aggregationSELECT A.BFROM MYREL XWHERE …

8

Administration of a DatabaseINGRES:

Only the DBA can create shared relations and grant access to them

It’s possible for a user to create temporary relations (and required for query processing)

System-R: Anyone can create private relations and grant access to

them It’s possible to create persistent or temporary relations Mechanisms to add columns to tables (by default these

become NULL)

9

Integrity and Security Both systems allowed much more general

notions of integrity than key constraints Assertions as a means of validation!

Query modification is one method used to do this Conjunction of assertions plus the query/update Only limited expressiveness – in INGRES there must be

at most one variable Security models:

System-R used a view-based security model INGRES used query modification How do these differ? Which is better?

Data Storage

11

Physical Data Layout Tuples must fit within a

page All tuples have TIDs

TID is a page + index System-R

Allowed links between tuples

Built-in concept of NULL Where else are TIDs

useful?

Page-level clustering: done by System-R in extents

What about INGRES?

t1t2 t3

12

Access Paths – INGRES INGRES Access Methods Interface (AMI):

Can have unordered “heap file” Hash-based indexing ISAM-like indexing on a primary key

Predates B+ Tree – initially height-balanced, but afterwards index structure is static; requires overflow pages

“ISAM-like” because can’t sort across pages – can only lock one page at a time!!!

Lookups are done via OPENR(), GET() FIND() can be used to find a start or stop point Can call PARAMD, PARAMI to get parameters of data or index

Idea of extensible access methods – later revisited in Generalized Search Trees (GiST)

13

Access Paths – System-R “Images”

Slightly more than indices (which are included there) – also linked lists and orderings

Index structures include ISAM and B Trees Search arguments, aka sargable predicates

Note that clustering was a key concept here Links via TIDs

Note that these were NOT relational in spirit! Intention was to support IMS over RSS Some code likely made its way back into IMS

How do these concepts relate to OO databases and today’s XML databases?

14

Catalogs and Indices as Logical Abstractions System-R and INGRES decided to express

catalogs as relations, making them accessible to queries as well as enabling reuse of code How are they accessible in today’s RDBMSs?

INGRES even did this with indices – 1:1 mapping between files and relations Indices vs. views: how are they different? How are they different in the presence of TIDs or

links?

Query Processing

16

Query Processing - INGRES DECOMP algorithm and One Variable Query Processor

Break every query into separate operations that generate a temp relation for every projection, selection, join

Pick a relation to iterate over, substitute value for variable Repeat recursively

Note that (1) it’s interpreted, (2) it’s non-pipelined, and (3) it’s adaptive, always choosing the smallest relation for the substitution

Note that access paths were chosen in a pretty ad hoc way Also, no concept of sargable predicates – instead, could FIND

both upper and lower bounds, but needed to GET between

17

DECOMP Example RANGE OF E, M IS EMPLOYEE

RANGE OF D IS DEPTRETRIEVE (E.NAME)WHERE E.SALARY > M.SALARY AND E.MANAGER = M.NAME AND E.DEPT = D.DEPT AND D.FLOOR# = 1 AND E.AGE > 40

First: apply selection and projection

18

DECOMP, Ctd. RANGE OF D IS DEPT

RETRIEVE INTO T1 (D.DEPT)WHERE D.FLOOR# = 1

RANGE OF E IS EMPLOYEERETRIEVE INTO T2(E.NAME, E.SALARY, E.MANAGER, E.DEPT)WHERE E.AGE > 40

Now substitute these back into the main query

19

DECOMP, Ctd. RANGE OF E IS T2

RANGE OF M IS EMPLOYEERANGE OF D IS T1RETRIEVE (E.NAME)WHERE E.SALARY > M.SALARY AND E.MANAGER = M.NAME AND E.DEPT = D.DEPT

Now pick the relation with smallest cardinality, e.g., T1, and substitute

20

One-Variable Substitution foreach D in T1, recursively process:

RANGE OF E IS T2RANGE OF M IS EMPLOYEERETRIEVE (E.NAME)WHERE E.SALARY > M.SALARY AND E.MANAGER = M.NAME AND E.DEPT = *valueOf(D.DEPT)*

Now apply the selection to E…RANGE OF E IS T2RETRIEVE INTO T3 (E.NAME, E.SALARY, E.MANAGER, E.DEPT)WHERE E.DEPT = *valueOf(D.DEPT)*

21

DECOMP, Recursively So the query looks like:

foreach D in T1 RANGE OF E IS T2

RETRIEVE INTO T3 (E.NAME, E.SALARY, E.MANAGER, E.DEPT)WHERE E.DEPT = *valueOf(D.DEPT)*

RANGE OF E IS T3RANGE OF M IS EMPLOYEERETRIEVE (E.NAME)WHERE E.SALARY > M.SALARY AND E.MANAGER = M.NAME

Now choose the smallest relation for substitution (e.g., T3)

22

DECOMP, Recursively foreach D in T1

foreach D in T1 RANGE OF E IS T2

RETRIEVE INTO T3 (E.NAME, E.SALARY, E.MANAGER, E.DEPT)WHERE E.DEPT = *valueOf(D.DEPT)*

foreach E2 in T3 RANGE OF M IS EMPLOYEE

RETRIEVE INTO (val1)WHERE *valueOf(E2.SALARY)* > M.SALARY AND *valueOf(E2.MANAGER)* = M.NAME

23

Query Processing – System-R Optimization is cost-based

Consider both disk cost (primary) and CPU cost (scaled by some H)

Compare clustered, non-clustered indices; sequential scan Needed to consider cases where data was interspersed

with other relations

Every join is binary; query plans are compiled Merge join and nested loops join are present Also have a link-based join No dynamic programming yet – Selinger was not yet

aboard

24

Views INGRES:

Not supported at that time (though later) System-R:

Single-table views are typically updatable, if one-to-one (contrast with today’s SQL) Additionally, cursors could typically be modified!

Views as a security/encapsulation mechanism GRANT and REVOKE privileges

25

Triggers Note that System-R had triggers from the

beginning! Later work led to the idea of active

databases, which had very rich triggger languages

ACIDity

27

Rollback / Abort Notion of transactions in System-R

encompassed multiple operations BEGIN_TRANS, END_TRANS, SAVE, RESTORE

Both systems had a notion of “old” and “new” pages (“shadow paging”) Could roll back transactions by swapping back

the old page But how did that work with concurrency?

28

Concurrency & Locking INGRES:

Supports single QUEL statements as transactions Query locked all resources (table-level) before it began All page updates are atomic via locks Avoid deadlocks by preventing cases where they could

occur – this is why they don’t do true ISAM System-R:

Multiple levels of lock granularity logical: table-, range-level physical: page-, tuple-level

Shared, exclusive locks Multiple isolation levels (READ UNCOMMITTED, READ

COMMITTED, SERIALIZABLE; no REPEATABLE READ) Resolve deadlocks by choosing a victim, restarting

29

Recovery INGRES:

Deferred updates – for performance, isolation Also used to make recovery possible But how far does this take you?

System-R: Notion of checkpoints and restarting Transaction logging and replay Not much mentioned about possibility of

recovery from failed restart Today’s techniques even recover from that

Analysis

31

What Ideas from this WorkWere Broken? Shadow paging

Now everything is purely log-based SQL idiosyncrasies INGRES recovery Relations as files Query optimization

32

Discussion:What Ideas Are We Still Using?

Documents

Archetypal Databases, or, What Is a DBMS? Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems September 9, 2008