82
1 ICS 224: Database Management Systems Spring 2011 Professor Sharad Mehrotra Information and Computer Science Department University of California, Irvine

ICS 224: Database Management Systems Spring 2011

  • Upload
    hue

  • View
    34

  • Download
    1

Embed Size (px)

DESCRIPTION

ICS 224: Database Management Systems Spring 2011. Professor Sharad Mehrotra Information and Computer Science Department University of California, Irvine. Course General Info. URL: http://www.ics.uci.edu/~cs224/ All course info will be posted online Lecture times: Tue-Thurs 5 – 6.30 - PowerPoint PPT Presentation

Citation preview

Page 1: ICS 224: Database Management Systems  Spring 2011

1

ICS 224: Database Management Systems Spring 2011

Professor Sharad Mehrotra

Information and Computer Science Department

University of California, Irvine

Page 2: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 2

Course General Info

• URL: http://www.ics.uci.edu/~cs224/– All course info will be posted online

• Lecture times: Tue-Thurs 5 – 6.30

• Instructor: Sharad Mehrotra, BH 2082, [email protected]

• Office Hours: on request

Page 3: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 3

Prerequisites• Basic Data Management Concepts:

– DB design, relational model, SQL, database programming CS 122 or equivalent

– Database system implementation Indexing, query optimization, query processing, storage management,

etc. ICS 222 or equivalent

• Basic Computer Science Concepts:

– Depth-first search, directed/undirected graphs, “big O” notation, computational complexity, NP completeness …

Page 4: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 4

Course Requirements• Class Participation: 50%

– Attendance, presentations, comments, interaction, enthusiasm, etc.

• Class Projects: 50%– Implementation Oriented:

Take a idea/topic, identify a project, get it okayed by instructor, develop a demonstration

– Survey of an area In depth survey in the style of computing survey

articles. Provide your own perspective in a subarea.

– MUST commit to project at end of 2nd week.

Page 5: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 5

Class Structure

• Each week we will– Pick a topic– identify 1 paper per student/group of 2 students – 2 papers as lead papers for presentation (one

for each class), others presented as short presentations

• Each week– Start with overview– Lead paper presentation– short presentation of other papers (main idea)– Discussions

Page 6: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 6

This course …

Most important ideas in data management (instructor’s pick)

But with the eye towards an end application …

Sentient spaces

Page 7: ICS 224: Database Management Systems  Spring 2011

Sentient Spaces … • Spaces in which sensors are used to capture the dynamic

evolving state which is then analyzed for implementing adaptations.

• Numerous examples … – intelligent transportation systems– reconnaissance– surveillance systems– smart buildings– smart grid ...

7

Page 8: ICS 224: Database Management Systems  Spring 2011

Example:Smart Video Surveillance

CS Building in UC Irvine

Video collection

8

SurveillanceVideo

Database

SemanticExtraction

EventDatabase

Query

Query Analysi

s

Page 9: ICS 224: Database Management Systems  Spring 2011

Implications of Sentient Space focus ..

• Class focuses on topics which you might need to know if you wanted to explore application in sentient space …

• Projects should target something about sentient spaces … – E.g., data cleaning of sentient data, data

model to represent sentient spaces, …

ICS214A Notes 01 9

Page 10: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 10

Data Models (2 weeks)

– Representing time - TSQL2– Representing space– Querying streaming data – CQL,

ASQL– Semi-structured data –OEM, Lore

Page 11: ICS 224: Database Management Systems  Spring 2011

New Ideas in Storage & Indexing (2 weeks)

• New storage models– Key-Value store– Bigtable– Column Stores

• New database system architecture– Data outsourcing– Multitenant databases

• New Indexing techniques– Correlation maps

ICS214A Notes 01 11

Page 12: ICS 224: Database Management Systems  Spring 2011

Data Quality (2 weeks)

• Data quality issues– Inaccuracy, incompleteness, ambiguity,

errors, …

• Two aspects:– Techniques to improve quality

Exploiting contextual knowledge, issues of efficiency

– Techniques to tolerate poor quality of data in applications.

ICS214A Notes 01 12

Page 13: ICS 224: Database Management Systems  Spring 2011

New Computing Architecture (2 weeks)

• Map Reduce framework• Hive• Pig latin• Join processing• HadoopDB• Hyrax?

ICS214A Notes 01 13

Page 14: ICS 224: Database Management Systems  Spring 2011

Data Privacy (2 weeks)

• Use cases – Data publishing, queries, sharing,

data outsourcing.

• Diverse criteria– Differential privacy, Anonymity, l-

diversity, ..

• Mechanisms to implement

ICS214A Notes 01 14

Page 15: ICS 224: Database Management Systems  Spring 2011

16

A walk down the history of data models …

Two papers (MUST READ)•Inclusion of New types in relational databases, Stonebraker•Postgrest Next Generation databsase, Stonebraker.

Page 16: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 17

The Paleolithic Period …• There were no general purpose tools for

managing large volumes of data…– OS provided resource management– Data was stored in files– Applications performed data management

functionalities Fault-tolerance Concurrency control Reliability Optimizations …

– Such functionalities had to be re-implemented for each application

Page 17: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 18

The Neolithic Period…• Early file systems evolve into general-purpose data

management tools.• DBMS Goals:

– Efficiency and scalability (faster than files) – Management of large heterogeneous types of structured

data– High reliability– Information sharing (multiple users)

• DBMS Users:– E-commerce companies, banks, airlines, transportation

companies, corporate databases, government agencies, …– Anyone you can think of!

Page 18: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 19

The Dark Ages ….

• Network & hierarchical data models– Resulted in data spaghetti– Applications needed to chase pointers – There was little data abstraction or separation

of concerns little difference between physical data

representation and logical data representation

– optimization was entirely left to application writers

– There were no clean data management languages Unless you are a Cobol fan!

Page 19: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 20

The Relational Era..• Relational model proposed by Codd

– Everything is a relation– Query consists of algebraic composition of a few powerful

operators– Equivalent to a first-order relational calculus

• Primary features– Simple clean data representation

solid mathematical basis– data abstraction

Users did not need to be concerned about how data is stored physically

– simple declarative query language User’s specify what to compute not how to do it.

– optimization by the system

Page 20: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 21

Data Wars (1)• Codasyl versus relational debates began…

– Heated arguments during early SIGMODS– Codasyl: relational model is too simple,

applications built using it will never scale in performance.

– Relational: network/hierarchical models have no formal basis, are too complex, and unmanageable as application complexity increases.

• Relational model found many supporters– Specially at universities– Its simplicity was enticing

Page 21: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 22

Data Wars (2)

• Many projects started off trying to implement a relational DBMS– System R @ IBM Almaden– Ingres @ Berkeley– These early systems led to the technologies that drive modern data

management• Early prototypes became products

– DB2 & Ingres• Principle designers from both the System R teams & Ingres left to

start companies– Oracle, Sybase

• Early relational companies went door to door converting industry to the relational model– Industry got hooked on to the simplicity of writing complex applications

in relational model– Boeing among the first converts

Page 22: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 23

Pointer’s Strike Back…

• Complex objects in emerging DBMS applications cannot be effectively represented as records in relational model.

• Representing information in RDBMSs requires complex and inefficient conversion into and from the relational model to the application programming language

• ODBMSs provide a direct representation of objects to DBMSs overcoming the impedance mismatch problem

Application

data structures

Relational

representation

RDBMS

Copy and

translation

Transparent

ODBMS

data transfer

Page 23: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 24

Object Model

• Object:– observable entity in the world being modeled– similar to concept to entity in the E/R model

• An object consists of:– attributes: properties built in from primitive types– relationships: properties whose type is a reference

to some other object or a collection of references– methods: functions that may be applied to the

object.

Page 24: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 29

Object Oriented Databases

• Evolved as persistent Object Oriented Programming Languages:

• Start with an OO language (e.g., C++, Java, SMALLTALK) which has a rich type system

• Add persistence to the objects in programming language where persistent objects stored in databases

Page 25: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 30

Persistent Programming Languages

• Single programming language for application and data management

• Update to persistent variable results in automatic update to

database.

• Persistent data could be types such as sets and lists and arrays.

• Application can follow pointers (OID) to navigate through data.

ii 2

a[ j] a[ j 1] 3

Employee Spouse benefit_levelbenefitlevel1

Page 26: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 31

Persistence

• Objects created may have different lifetimes:– transient: allocated memory managed by the programming

language run-time system. E.g., local variables in procedures have a lifetime of a procedure

execution global variables have a lifetime of a program execution

– persistent: allocated memory and stored managed by ODBMS runtime system.

• Classes are declared to be persistence-capable or transient.

• Different languages have different mechanisms to make objects persistent:– creation time: Object declared persistent at creation time (e.g., in

C++ binding) (class must be persistent-capable)– persistence by reachability: object is persistent if it can be

reached from a persistent object (e.g., in Java binding) (class must be persistent-capable).

Page 27: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 32

Persistent Object-Oriented Programming Languages• Persistent objects are stored in the database and

accessed from the programming language.• Single programming language for applications as well as

data management. – Avoid having to translate data to and from application

programming language and DBMS efficient implementation less code

– Programmer does not need to write explicit code to fetch data to and from database

persistent objects to programmer looks exactly the same as transient objects.

System automatically brings the objects to and from memory to storage device. (pointer swizzling).

Page 28: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 33

Approaches To Persistent Programming

• Persistent Virtual Memory

– disk representation and memory representation of data is

identical.

– No cost to translate data from one representation to another—

efficient!

– DB size limited to address space

32bit processor 2^32 byte addressability (4 GBytes)

– Differentiating persistent objects and non-persistent objects is

difficult.

– Difficult to optimize disk layout and locality of access.

– Example system using approach:

OBJECT STORE.

Page 29: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 34

Approaches To Persistent Programming Languages

• Store persistent objects in files

– Objects brought to memory on demand.

– Implementation of OID complex since pointers do not suffice

in general.

If object in memory pointer can be used for OID

if object on disk a disk address still not good as OID since

storage can be reorganized. A separate mechanism needed.

Pointer swizzling for efficiency.

Page 30: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 35

Challenges In Building Persistent Languages

• Efficient caching of objects in client address space.

– Cache coherence.

• In OODB data migrates to clients unlike relational

client server systems where query migrates to

server.

• Given a large number of clients each with the cache

of objects ensuring consistency of object across

multiple clients is a challenge.

Page 31: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 36

Disadvantages of ODBMS Approach

• Low protection– since persistent objects manipulated from applications directly,

more changes that errors in applications can violate data integrity.

• Non-declarative interface:– difficult to optimize queries– difficult to express queries

• But …..– Most ODBMSs offer a declarative query language OQL to

overcome the problem.– OQL is very similar to SQL and can be optimized effectively.– OQL can be invoked from inside ODBMS programming

language.– Objects can be manipulated both within OQL and programming

language without explicitly transferring values between the two languages.

– OQL embedding maintains simplicity of ODBMS programming language interface and yet provides declarative access.

Page 32: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 37

The Return of the Relations … POSTGRES

• Relational model evolved into ORDBMSs that include “best of”

object-oriented concepts

• Amongst the first ORDBMS prototype built @ Berkeley

POSTGRES Illustra

Informix IUS

• Has had major impact on major commercial DBMS which have

all migrated to ORDBMS model.

• SQL3 supported by modern databases adapted many of the

concepts developed in Postgres

bought bycommercialized

Page 33: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 38

POSTGRES — Combinations

• Introduced object orientation into relation DBMSs.

• Fundamental Concepts.

– Each record has an OID.

– Access to data though:

query language POSTQUEL.

navigation through OIDs.

– Classes:

– Inheritance:

– Types: rich set of types available for columns.

– Functions: can be called within POSTQUEL.

Page 34: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 39

Classes And Inheritance

• Class analogous to relation

• User can create new class

create Emp (name = c12, salary = float, age = int)

• Classes can inherit from others

create Salesman (quota = float) inherits Emp

• Multiple inheritance permitted. If new class causes ambiguity it is not

created.

• Classes:

– real: base classes or relations

– derived: views

– version: maintained differentially compared to parent class

Page 35: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 40

Types In POSTGRES

• Standard base types

– float, int, charac. Strings, etc.

– Abstract data type (ADT) facility to create new base types

e.g.; create type point (x = int, y = int)

create type polygon

• ADT’s can be used in class definitions.

Create Dept( dname = c10,

mgr = c12,

floorspace = polygon

mailstop = point

)

mailstop

Page 36: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 41

Functions In POSTGRES

• Three types: (1) C functions

(2) Operators

(3) POSTQUEL functions

• C-functions

– any C-function over base types or composite typeretrieve (Dept. name) where

area (Dept. floorspace) > 500

retrieve (Emp. name)

where overpaid (Emp)

Function over a class or method

Page 37: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 42

Operators

• Arbit C-functions are not optimized by query optimizers.

– Special functions - operators can utilize indexes for their evaluation.

• Operator: function with 1 or 2 operandretrieve (Dept. name)

where Dept. floor space-AGT “(0,0), (1,1), (0,2)”

• Index (e.g.; B-tree) defined properly can be used to speed up

evaluation of operators such as AGT.

Area Greater Than

Page 38: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 43

Other Features Of POSTGRES

• Allowed creation of new indices by user.

• To an extent pioneered the approach of extensible

database technology which is prevalent with

vendors today

• Supported transitive closure in query.

retrieve* into ans (parent. older)

from a in answer where.

Parent. younger = “John” or

parent. younger = a. older

• Supported rules

Page 39: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 45

POSTQUEL Functions

• Any collection of commands in POSTQUEL.– query = POSTQUEL function.

define function high-pay

returns Emp as

retrieve (Emp. all)

where Emp. salary > 50k

• POSTQUEL function with parameters.define function Sal-lookup (c12)

returns float as

retrieve (Emp. Salary)

where Emp. name = $1

• Usage of POSTQUEL functionretrieve Emp. name

where Emp. Salary = Sal-lookup (“Joe”)

Page 40: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 46

Composite Types In POSTGRES

• POSTQUEL:

– Composite types accessed via path expressions, using

nested dot notation.

remove (Emp mgr age)

where (Emp name = ‘joe’)

• Prevents having to specify a join.

Page 41: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 47

Composite Types In POSTGRES

• Attributes can have a class name as a type resulting

in complex objects with structure.

Create Emp ( name = c12,

salary = float [c12],

age = int,

mgr = Emp,

coworker = Emp

)

• A set type that can hold elements of any class.

Add to Emp (hobbies = set)

Refers to 0 or more references of Emp class.

Could be elements of any class

Page 42: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 48

Types In POSTGRES

• Array type (constructor)crate Emp ( name = c12,

salary = float [12],

age = int

)

• POSTQUEL query

retrieve (Emp name)

where (Emp salary [4] = 1000)

Salary for each month.

Array in query usage.

Page 43: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 49

Database Technology Matrix

File System

RDBMSs ORDBMSs

OODBMSs

YES

Simple Complex

Database Types

NO

Qury

Support

Page 44: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 50

XML & RDF - the new revolution

• Just when relational model had driven out object-oriented database technology, WWW led to the proliferation of semi-structured data.

• 2 approaches to supporting XML/RDF– Extend relational technology to support

XML/RDF– Native XML databases

Page 45: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 51

Summary of Evolution of Data Model• The Dark Ages: network & heirarchical models

• Victory of simplicity and beauty over data spaghetti: The Relational DBMS:

• The pointers strike back -- Object-Orientation, OODBMSs

• The return of the relations -- ORDBMS -- took the best of the OO concepts and incorporated them in the relational model.

• The current and near future -- support for XML & RDF

• The final frontier -- anyone’s guess!

Page 46: ICS 224: Database Management Systems  Spring 2011

52

Key Data Management Technologies (quick

review)…

Page 47: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 53

Key Database Technologies

• File Management– provides a file abstraction as a collection of records stored in

disk

• Index Management and Access Methods– implements techniques for associative access to data

• Query Optimization and Processing– given a query and data storage structures, determines an

efficient strategy to evaluate the query.

• Transaction management– ensures consistency of the database in presence of

concurrent transactions and various types of failures

• Catalog Management– maintains database schema information

• Authorization and Integrity Management– tests for integrity constraints and user authorization

Page 48: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 54

Database Management System Architecture

Database andIndices

TransactionManager

Buffer manager

File system

Metadataand data

dictionary

compilers

evaluatoroptimizer Query processor

Storage manager

Application Queries Schema changes

Page 49: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01

Storage Media and their Properties

• Main Memory– costs $100/Mbyte -- reduces every year– ‘volatile’ -- does not survive system failures– random I/O very fast– data can be processed by CPU directly– capacity limited to orders of magnitude lower than what

database needs.

• Magnetic Disk– costs $0.50/Mbyte -- reduces each year– Non-volatile (except when disk crashes)– random I/O not as fast– CPU cannot directly process data. Needs to be transferred to

main memory

• Tape– Cheaper but slower than disks. Sequential I/O devices. Handy for

backups, sometimes for archival.

Page 50: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 56

Databases and Storage Devices• Due to capacity, cost, volatility factors databases traditionally

stored in disks.• Data brought to main memory for processing from disks• There are many ways to interface memory with disk resident data• E.g., virtual memory:

– VM size limited to max address generated by CPU– Existing VM does not support durability

• File system provides a more powerful mapping between memory and disk storage

• A bunch of tricks used ensure that high latency of secondary storage does not impact application response time and system throughput– access disks asynchronously with active applications– prefetch data before application needs it– intelligent caching techniques

Page 51: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 57

Functional Abstraction of a Simplistic DBMS

beginTSQLSQLendT

beginTSQLSQLendT

Query Processor

optimizer

Record-oriented file system

Basic file system

Buffer manager

Hardware

SQL statements

Read write records, scan relations

Get page containing tuples

Read/write file pages

Access plan

Page 52: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 58

Basic File System

• Provides the abstraction of a file where a file is an array of fixed size blocks

• Hides the disk geometry -- cylinders, tracks, sectors, slots and other functional components like arms, head, etc. such that the programs do not need to deal with these complexities

• Operations supported:– create a file– delete a file– open a file– close a file– extend a file– read (set of) file blocks into buffers in memory– write (set of) file blocks

Page 53: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 59

Basic File System Design Issues

• File allocation: how to allocate blocks on disk to a file.– Contiguous allocation: file stored in contiguous disk blocks. Blocks for

storing file found using either of best-fit, worst-fit or first-fit policies. +ve: provides fast sequential scan of file -ve: fragmentation, difficult to enlarge files

– Linked allocation: file is a linked list of disk blocks +ve: prevents fragmentation, easy to enlarge files -ve: slow for both sequential and random access

– Index allocation: file implemented using fixed size blocks pointed to by an index (e.g., B-tree). Popularized by Unix

+ve: good random access, easy enlargement, no fragmentation. -ve: poor sequential access performance

– Extent based allocation: file is a collection of clusters of consecutive disk blocks (extents) where collection maintained using linked lists or index

Most popular approach with vendors.

• Free space management: information about which blocks are free

Page 54: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 60

Buffer Management

• Makes file pages addressable in memory and coordinates writing of pages to disk with other components to guarantee transactional properties

• Acts as a mediator between basic file system and record-oriented file system

• Buffer frames maintained in main memory. When a request for file page access comes, check if page in buffer. Else get a free frame and load file page into buffer

• Operations Supported:– bufferfix– bufferunfix– get block– flush

Page 55: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 61

Database Buffer Management Design Issues

• DBMS buffer manager returns pointer to frame containing data instead of returning copy of requested page to caller.– Efficiency: prevents unnecessary copying of data– Allows sharing of data at finer granularity than a page

2 transactions T1 and T2. T1 and T2 update records r1 and r2 on same page if buffer manager allowed applications to copy data to their address

space and rewrite updated versions, updates might be lost

• Database buffer manager participates in protocols to implement transactions (WAL, FL@C, pinning buffer slots)

• Novel page replacement strategies:– Traditional LRU strategy used in OS works well only under the

assumption of locality of reference which may not hold for DBMSs– Since DBMS query language are declarative, system has much

more information about reference patterns which it can exploit to improve caching performance of buffer manager

Page 56: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 62

Record-Oriented File System• Provides the abstraction of a file as a collection of records.• Records can be:

– fixed size or variable length– short, long, or very long– attributes can be fixed length or variable length– simple or complex (e.g., containing set valued attributes)

• Operations supported:– create, delete, open, close, alter, drop– read, insert, update, delete record– scan all records in a file

• Issues Involved:– mapping records to pages– file organization: organization of records in a file.

Where to insert new records what mechanism can be used to retrieve records

Page 57: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01

Index Management and Associative Access

• Associative access: accessing records based on their attribute values.

• Index Files

– an index file declared over a (set of) attribute of the data file provides associative access to records in the data file.

– Index file contains pointers to disk blocks where the record corresponding to the value appear.

• Types of an Index: (let indexing attribute be A)– primary: A is a key and data file stored sorted on A– clustered: A is not a key but data file stored sorted on A– secondary (key): A is a key but data file not sorted on A– secondary (non-key): A is neither a key and nor is data file

sorted on A.

Page 58: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 64

Organization of Index File• B-tree Index: index file is organized as a B-tree

– Advantages: Supports range searches efficiently. E.g., retrieve all employees with salary between 100K and 200K

– Disadvantages: Guaranteed good storage utilization searching for a given record could take around 3-4 disk I/Os

• Hash Index: index file maintained as a hash file.– Advantages:

Looking for a specific record very efficient -- 1 disk I/O

– Disadvantages: cannot support range searches

• Multdimensional Access Methods– modern databases are beginning to support novel data structures

like R-trees, grid files, inverted lists to better serve emerging application requirements

Page 59: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 65

Multidimensional Indexing Motivation• Many applications of databases are geographical = 2-d data.

Others involve large number of dimensions• Examples:

– location of restaurants in a city.– Map data: zones, county lines, rivers, lakes, etc. (Data has spatial

extent)– Sales information described by store, day, item, color, size, etc. Sale

= point in multidimensional space.– Student described by age, zipcode, marital status.

• Queries:– Range Query: “ find all McDonald restaurant within a given region”.– Nearest Neighbor Query: Find the nearest McDonald to my house– partial match queries

Page 60: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 66

Approach: Utilize Single Dimensional Index• Index on attributes independently• Project query range to each attribute determine pointers.• Intersect pointers • go to the database and retrieve objects in the intersection.

May result in very high I/O cost

Page 61: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 67

R-tree Data Structure• Extension of B-tree to

multidimensional space.

• Paginated, balanced, guaranteed storage utilization.

• Can support both point data and data with spatial extent

• Groups objects into possibly overlapping clusters (rectangles in our case)

• Search for range query proceeds along all paths that overlap with the query.

Page 62: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 68

Split Node• Given a node split it into two nodes which are each atleast half full• Multiple Objectives:

– minimize overlap– minimize covered area

• R-tree minimizes covered area• What is an optimal criteria???

Minimize overlap Minimize covered area

Page 63: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 69

Minimizing Covered Area

• Group objects into 2 parts such that the covered area is minimized

• NP Hard!!• Hence use heuritics• Two heuristics explored

– quadratic and linear

Page 64: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 70

Other Multidimensional Data Structures• Many generalizations of R-tree

– different splitting criteria– different shapes of clusters (e.g., d-dimensional spheres)– adding redundancy to reduce search cost:

store objects in multiple rectangles instead of a single rectangle to reduce cost of retrieval. But now insert has to store objects in many clusters. This strategy also increases overlap causing search performance to detoriate.

• Space Partitioning Data Structures– unlike R-tree which group objects into possibly overlapping clusters,

these methods attempt to partition space into non-overlapping regions.

– E.g., KD tree, quad tree, grid files, KD-Btree, HB-tree, hybrid tree.

• Space filling curves– superimpose an ordering on multidimensional space that preserves

proximity in multidimensional space. (Z-ordering, hilbert ordering)– Use a B-tree as an index on that ordering

Page 65: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 71

KD-tree

• A main memory data structure based on binary search trees– can be adapted to block model of

storage (KD-Btree)

• Levels rotate among the dimensions, partitioning the space based on a value for that dimension

• KD-tree is not necessarily balanced.

Page 66: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 72

KD-Tree Example

X=5

y=5 y=6

x=3

y=2

x=8 x=7

X=5 X=8

X=7X=3

Y=2

Y=6

Page 67: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 73

Adapting KD Tree to Block Model

• Similar to B-tree, tree nodes split many ways instead of two ways– Risk:

insertion becomes quite complex and expensive. No storage utilization guarantee since when a higher level

node splits, the split has to be propagated all the way to leaf level resulting in many empty blocks.

• Pack many interior nodes (forming a subtree) into a block.– Risk

it may not be feasible to group nodes at lower level into a block productively.

Many interesting papers on how to optimally pack nodes into blocks recently published.

Page 68: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 74

Quad Tree

• Nodes split along all dimensions simultaneously

• Division fixed: by quadrants• As with KD-tree we cannot make

quadtree levels uniform

Page 69: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 75

Quad Tree Example

X=5 X=8

X=7X=3SW

SE NE

NW

Page 70: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 76

Grid Files• Space Partitioning strategy but

different from a tree.• Select dividers along each

dimension. Partition space into cells

• Unlike KD-tree dividers cut all the way.

• Each cell corresponds to 1 disk page.

• Many cells can point to the same page.

• Cell directory potentially exponential in the number of dimensions

Page 71: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 77

Space Filling Curve• Assumption

– finite precision in representing each coordinate.

00 01 10 11

00

01

10

11

A B

C

Z(A) = shuffle(x_A, y_A) = shuffle(00,11)

= 0101 = 5

Z(B) = 11 = 3

(common prefix to all its blocks)

Z(C1) = 0010 = 2

Z(C2) = 1000 = 8

Page 72: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 78

Deriving Z-Values for a Region

• Obtain a quad-tree decomposition of an object by recursively dividing it into blocks until blocks are homogeneous.

00 10

1101

0001

11

0011

Objects representation

is

0001, 0011,01

Page 73: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 79

Generalized Search Trees• Motivation:

– disparate applications require different data structures and access methods.

– Requires separate code for each data structure to be integrated with the database code

too much effort. Vendors will not spend time and energy unless application very

important or data structure has general applicability.

• Generalized search trees abstract the notion of data structure into a template. – Basic observation: most data structures are similar and a lot of

book keeping and implementation details are the same.– Different data structures can be seen as refinements of basic GiST

structure. Refinements specified by providing a registering a bunch of functions per data structure to the GiST.

Page 74: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 80

GiST supports extensibility both in terms of data types and queries

• GiST is like a “template” - it defines its interface in terms of ADT rather than physical elements (like nodes, pointers etc.)

• The access method (AM) can customize GiST by defining his or her own ADT class i.e. you just define the ADT class, you have your access method implemented!

• No concern about search/insertion/deletion, structural modifications like node splits etc.

Page 75: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 81

Query Processing in DBMSs

Parsing and Translation

optimizer

Evaluation engine

Statistics about data

Select …From …Where ...

Internal relational algebra based

representation of query

Optimized execution plan

Data and index

Sally 4000

Dick 9000 ……...

Query results

Page 76: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01

Query Optimization• Goals: to find the cheapest evaluation strategy for a query• Stages of Optimization:

– algebraic manipulations: heuristics used to convert query tree into an equivalent but more efficient representation.

perform selections and projections as early as possible. combine selections with cartesian products to make a join combine sequence of unary operations (selections and projections). look for common subexpressions in an expression.

– Cost based Analysis: given optimized representation produced after algebraic manipulation:

generate all possible query plans and estimate their costs based on the statistical information and costs of each unary and binary operations.

Best possible query plan chosen as an execution strategy. Number of plans considered even after heuristic are applied is

exponential in the number of operators in query tree. It is important to choose a good plan since cost of generating plan amortized over multiple query executions.

Page 77: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01

Cost of Query Execution

• Access to disk: cost of reading, writing, searching data blocks. (i/o cost)

• Storage Costs: cost of storing intermediate files generated during query execution. (i/o cost)

• Computation cost: cost of in memory execution of operations. (cpu cost)

• Communication cost: cost of shipping the query and results from site to site or terminal where query originated. (communication cost)

• Total cost = I/O cost + w1* CPU cost + w2 *Communication cost

• Traditionally I/O cost considered most important

Page 78: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 84

Transaction Management

Applications in databases are modeled as transactions which provides ACID guarantees.

• Atomicity: either all the effects of a transaction appear in database or none of the effects of a transaction appears in database.

• Consistency: each transaction maps a database from consistent state to another consistent state

• Isolation: concurrent execution of trasnactions is hidden from other concurrently executing transactions

• Durability: if a transaction completes its effects are permanent and survive failures.

Page 79: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 85

Transaction Model

• Transactions provide a simple, powerful, and a natural programming model for writing database applications.

• Transaction concept supports:– simple failure semantics: either all the effects of

transaction appear in database or none do -- all or nothing– isolated view of the world: protection from partial effects of

other concurrent applications.

• Transactions allows applications to share data without having to explicitly deal with either fault-tolerance or synchronization

• Transactions are the enabling technology for large distributed applications.

Page 80: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 86

Isolation• Isolation is implemented by using 2 phase locking protocol• 2 Phase Locking Protocol:

– Each transaction acquires a lock on a data item before accessing data

– Locks are released when a transaction commits

User 1 reads account = 1500

User 2 reads account = 1500

User 1sets account value = 500(withdraws 1000 dollars)

User 2 sets account value = 700(withdraws 800 dollars)

tim

e

The execution will be prevented by 2 phase locking since user 1’s transaction will not release the lock on account until user 1 transaction terminates

Page 81: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 87

Atomicity• Atomicity is implemented by using a logging strategy.• A transaction, before updating a data item writes a undo log

record, using which its effects can be undone.• If transaction aborts then undo log records used toreconstruct

database state before transaction execution

Undo log record

DO

Old state New state

Normal processing

Transaction rollback due to either user requested abort, system failure, consistency violation

Undo log record

UNDO

Old stateNew state

Page 82: ICS 224: Database Management Systems  Spring 2011

ICS214A Notes 01 88

Durability• Durability is implemented using logging strategy• A transaction, before updating a data item, writes a redo log

record using which its effects are redone• If system fails before a committed transaction’s effects appear

in database its effects are redone using redo log records on recovery.

Redo Log record

DO

Old state New state

Normal processing

Redo of committed transaction

Redo log record

REDO

Old state

New state