CS 257: Database System Principles Final Exam by Ronak Shah
(214) SJSU ID: 006260709 under guidance of Dr. Lin
Slide 2
Sections 13.1 13.3 Sanuja Dabade & Eilbroun Benjamin CS 257
Dr. TY Lin SECONDARY STORAGE MANAGEMENT
Slide 3
13.1.1 Memory Hierarchy Data storage capacities varies for
different data Cost/byte to store data also varies Device with
smallest capacity offer the fastest speed with highest cost per
bit
Slide 4
Memory Hierarchy Diagram Programs, DBMS Main Memory DBMSs Main
Memory Cache As Visual Memory Disk File System Tertiary
Storage
Slide 5
13.1.1 Memory Hierarchy Cache Lowest level of the hierarchy
Data items are copies of certain locations of main memory
Sometimes, values in cache are changed and corresponding changes to
main memory are delayed Machine looks for instructions as well as
data for those instructions in the cache Amount of data that can be
cached is limited No need to update the data in main memory
immediately in a single processor computer In multiple processors
data is updated immediately to main memory.called as write
through
Slide 6
Main Memory Refers to physical memory that is internal to the
computer. The word main is used to distinguish it from external
mass storage devices such as disk drives. Everything happens in the
computer i.e. instruction execution, data manipulation, as working
on information that is resident in main memory Main memories are
random access.one can obtain any byte in the same amount of
time
Slide 7
Secondary storage Used to store data and programs when they are
not being processed More permanent than main memory, as data and
programs are retained when the power is turned off A PC might only
require 20,000 bytes of secondary storage E.g. magnetic disks, hard
disks
Slide 8
Tertiary Storage Consists of one to several storage drives. It
is a comprehensive computer storage system that is usually very
slow, so it is usually used to archive data that is not accessed
frequently. Holds data volumes in terabytes Used for databases much
larger than what can be stored on disk
Slide 9
13.1.2 Transfer of Data Between levels Data moves between
adjacent levels of the hierarchy At the secondary or tertiary
levels accessing the desired data or finding the desired place to
store the data takes a lot of time Disk is organized into bocks
Entire blocks are moved to and from memory called a buffer A key
technique for speeding up database operations is to arrange the
data so that when one piece of data block is needed it is likely
that other data on the same block will be needed at the same time
Same idea applies to other hierarchy levels
Slide 10
13.1.3 Volatile and Non Volatile Storage A volatile device
forgets what data is stored on it after power off Non volatile
holds data for longer period even when device is turned off
Secondary and tertiary devices are non volatile Main memory is
volatile
Slide 11
13.1.4 Virtual Memory computer system technique which gives an
application program the impression that it has contiguous working
memory (an address space), while in fact it may be physically
fragmented and may even overflow on to disk storage technique make
programming of large applications easier and use real physical
memory (e.g. RAM) more efficiently Typical software executes in
virtual memory Address space is typically 32 bit or 2 32 bytes or
4GB Transfer between memory and disk is in terms of blocks
Slide 12
13.2.1 Mechanism of Disk Mechanisms of Disks Use of secondary
storage is one of the important characteristic of DBMS Consists of
2 moving pieces of a disk 1. disk assembly 2. head assembly Disk
assembly consists of 1 or more platters Platters rotate around a
central spindle
Slide 13
13.2.1 Mechanism of Disk Disk is organized into tracks The
track that are at fixed radius from center form one cylinder Tracks
are organized into sectors Tracks are the segments of circle
separated by gap
Slide 14
Slide 15
13.2.2 Disk Controller One or more disks are controlled by disk
controllers Disks controllers are capable of Controlling the
mechanical actuator that moves the head assembly Selecting the
sector from among all those in the cylinder at which heads are
positioned Transferring bits between desired sector and main memory
Possible buffering an entire track
Slide 16
13.2.3 Disk Access Characteristics Accessing (reading/writing)
a block requires 3 steps Disk controller positions the head
assembly at the cylinder containing the track on which the block is
located. It is a seek time The disk controller waits while the
first sector of the block moves under the head. This is a
rotational latency All the sectors and the gaps between them pass
the head, while disk controller reads or writes data in these
sectors. This is a transfer time
Slide 17
13.3 Accelerating Access to Secondary Storage Secondary
storage: Several approaches for more-efficiently accessing data in
secondary storage: Place blocks that are together in the same
cylinder. Divide the data among multiple disks. Mirror disks. Use
disk-scheduling algorithms. Prefetch blocks into main memory.
Scheduling Latency added delay in accessing data caused by a disk
scheduling algorithm. Throughput the number of disk accesses per
second that the system can accommodate.
Slide 18
13.3.1 The I/O Model of Computation The number of block
accesses (Disk I/Os) is a good time approximation for the
algorithm. Disk I/os proportional to time taken. Ex 13.3: You want
to have an index on R to identify the block on which the desired
tuple appears, but not where on the block it resides. For Megatron
747 (M747) example, it takes 11ms to read a 16k block. delay in
searching for the desired tuple is negligible.
Slide 19
13.3.2 Organizing Data by Cylinders Ex 13.4: We request 1024
blocks of M747. If data is randomly distributed, average latency is
10.76ms by Ex 13.2, making total latency 11s. If all blocks are
consecutively stored on 1 cylinder: 6.46ms + 8.33ms * 16 = 139ms (1
average seek)(time per rotation)(# rotations) First seek time and
first rotational latency can never be neglected
Slide 20
13.3.3 Using Multiple Disks Number of disks is proportional to
the factor by which performance is performance will increase by
improved Striping distributing a relation across multiple disks
following this pattern: Data on disk R 1 : R 1, R 1+n, R 1+2n, Data
on disk R 2 : R 2, R 2+n, R 2+2n, Data on disk R n : R n, R n+n, R
n+2n, Ex 13.5: We request 1024 blocks with n = 4. 6.46ms + (8.33ms
* (16/4)) = 39.8ms (1 average seek)(time per rotation)(#
rotations)
Slide 21
13.3.4 Mirroring Disks Mirroring Disks having 2 or more disks
hold identical copy of data. Benefit 1: If n disks are mirrors of
each other, the system can survive a crash by n-1 disks. Benefit 2:
If we have n disks, read performance increases by a factor of n.
Performance increases =>increasing efficiency
Slide 22
13.3.5 Disk Scheduling and the Elevator Problem Disk controller
will run this algorithm to select which of several requests to
process first. Pseudo code: requests[] // array of all
non-processed data requests upon receiving new data request:
requests[].add(new request) while(requests[] is not empty) move
head to next location if(head is at data in requests[]) retrieves
data removes data from requests[] if(head reaches end) reverses
head direction
Slide 23
13.3.5 Disk Scheduling and the Elevator Problem (cont) Events:
Head starting point Request data at 8000 Request data at 24000
Request data at 56000 Get data at 8000 Request data at 16000 Get
data at 24000 Request data at 64000 Get data at 56000 Request Data
at 40000 Get data at 64000 Get data at 40000 Get data at 16000
datatime Current time 0 4.3 Current time 10 Current time 13.6
Current time 20 Current time 26.9 Current time 30 Current time 34.2
Current time 45.5 Current time 56.8 8000 16000 24000 32000 40000
48000 56000 64000 datatime 8000..4.3 datatime 8000..4.3 24000..13.6
datatime 8000..4.3 24000..13.6 56000..26.9 datatime 8000..4.3
24000..13.6 56000..26.9 64000..34.2 datatime 8000..4.3 24000..13.6
56000..26.9 64000..34.2 40000..45.5 datatime 8000..4.3 24000..13.6
56000..26.9 64000..34.2 40000..45.5 16000..56.8
Slide 24
13.3.5 Disk Scheduling and the Elevator Problem (cont) datatime
8000..4.3 24000..13.6 56000..26.9 64000..34.2 40000..45.5
16000..56.8 datatime 8000..4.3 24000..13.6 56000..26.9 16000..42.2
64000..59.5 40000..70.8 Elevator Algorithm FIFO Algorithm
Slide 25
13.3.6 Prefetching and Large-Scale Buffering If at the
application level, we can predict the order blocks will be
requested, we can load them into main memory before they are
needed. This reduces the cost and time taken
Slide 26
13.4.Disk Failures Intermittent Error: Read or write is
unsuccessful. If we try to read the sector but the correct content
of that sector is not delivered to the disk controller. Check for
the good or bad sector. To check write is correct: Read is
performed. Good sector and bad sector is known by the read
operation. Checksums: Each sector has some additional bits, called
the checksums. They are set on the depending on the values of the
data bits stored in that sector. Probability of reading bad sector
is less if we use checksums. For Odd parity: Odd number of 1s, add
a parity bit 1. For Even parity: Even number of 1s, add a parity
bit 0. So, number of 1s becomes always even.
Slide 27
Example: 1. Sequence : 01101000-> odd no of 1s parity bit: 1
-> 011010001 2. Sequence : 111011100->even no of 1s parity
bit: 0 -> 111011100 Stable -Storage Writing Policy: To recover
the disk failure known as Media Decay, in which if we overwrite a
file, the new data is not read correctly. Sectors are paired and
each pair is said to be X, having left and right copies as Xl and
Xr respectively and check the parity bit of left and right by
substituting spare sector of Xl and Xr until the good value is
returned.
Slide 28
The term used for these strategies is RAID or Redundant Arrays
of Independent Disks. Mirroring: Mirroring Scheme is referred as
RAID level 1 protection against data loss scheme. In this scheme we
mirror each disk. One of the disk is called as data disk and other
redundant disk. In this case the only way data can be lost is if
there is a second disk crash while the first crash is being
repaired. Parity Blocks: RAID level 4 scheme uses only one
redundant disk no matter how many data disks there are. In the
redundant disk, the ith block consists of the parity checks for the
ith blocks of all the data disks. It means, the jth bits of all the
ith blocks of both data disks and redundant disks, must have an
even number of 1s and redundant disk bit is used to make this
condition true.
Slide 29
Failures: If out of Xl and Xr, one fails, it can be read form
other, but in case both fails X is not readable, and its
probability is very small Write Failure: During power outage, 1.
While writing Xl, the Xr, will remain good and X can be read from
Xr 2. After writing Xl, we can read X from Xl, as Xr may or may not
have the correct copy of X. Recovery from Disk Crashes: To reduce
the data loss by Dish crashes, schemes which involve redundancy,
extending the idea of parity checks or duplicate sectors can be
applied.
Slide 30
Parity Block Writing When we write a new block of a data disk,
we need to change that block of the redundant disk as well. One
approach to do this is to read all the disks and compute the
module-2 sum and write to the redundant disk. But this approach
requires n-1 reads of data, write a data block and write of
redundant disk block. Total = n+1 disk I/Os RAID 5 RAID 4 is
effective in preserving data unless there are two simultaneous disk
crashes. Error-correcting codes theory known as Hamming code leads
to the RAID level 6. By this strategy the two simultaneous crashes
are correctable. The bits of disk 5 are the modulo-2 sum of the
corresponding bits of disks 1, 2, and 3. The bits of disk 6 are the
modulo-2 sum of the corresponding bits of disks 1, 2, and 4. The
bits of disk 7 are the module2 sum of the corresponding bits of
disks 1, 3, and 4 Coping With Multiple Disk Crashes Reading/Writing
We may read data from any data disk normally. To write a block of
some data disk, we compute the modulo-2 sum of the new and old
versions of that block. These bits are then added, in a modulo-2
sum, to the corresponding blocks of all those redundant disks that
have 1 in a row in which the written disk also has 1.
Slide 31
Whatever scheme we use for updating the disks, we need to read
and write the redundant disk's block. If there are n data disks,
then the number of disk writes to the redundant disk will be n
times the average number of writes to any one data disk. However we
do not have to treat one disk as the redundant disk and the others
as data disks. Rather, we could treat each disk as the redundant
disk for some of the blocks. This improvement is often called RAID
level 5.
Slide 32
Arranging data on disk 13.5 Arranging data on disk Data
elements are represented as records, which stores in consecutive
bytes in same same disk block. Basic layout techniques of storing
data : Fixed-Length Records Allocation criteria - data should start
at word boundary. Fixed Length record header 1. A pointer to record
schema. 2. The length of the record. 3. Timestamps to indicate last
modified or last read.
Slide 33
Example CREATE TABLE employee( name CHAR(30) PRIMARY KEY,
address VARCHAR(255), gender CHAR(1), birthdate DATE ); Data should
start at word boundary and contain header and four fields name,
address, gender and birthdate.
Slide 34
Packing Fixed-Length Records into Blocks Packing Fixed-Length
Records into Blocks Records are stored in the form of blocks on the
disk and they move into main memory when we need to update or
access them. A block header is written first, and it is followed by
series of blocks. Block header contains the following information:
Links to one or more blocks that are part of a network of blocks.
Links to one or more blocks that are part of a network of blocks.
Information about the role played by this block in such a network.
Information about the role played by this block in such a network.
Information about the relation, the tuples in this block belong to.
Information about the relation, the tuples in this block belong
to.
Slide 35
A "directory" giving the offset of each record in the block.
Time stamp(s) to indicate time of the block's last modification
and/or access Along with the header we can pack as many record as
we can in one block as shown in the figure and remaining space will
be unused.
Slide 36
13.6 Representing Block and Record Addresses Address of a block
and Record In Main Memory Address of the block is the virtual
memory address of the first byte Address of the record within the
block is the virtual memory address of the first byte of the record
In Secondary Memory: sequence of bytes describe the location of the
block in the overall system Sequence of Bytes describe the location
of the block : the device Id for the disk, Cylinder number,
etc.
Slide 37
Addresses in Client-Server Systems The addresses in address
space are represented in two ways Physical Addresses: byte strings
that determine the place within the secondary storage system where
the record can be found. Logical Addresses: arbitrary string of
bytes of some fixed length Physical Address bits are used to
indicate: Host to which the storage is attached Identifier for the
disk Number of the cylinder Number of the track Offset of the
beginning of the record
Logical and Structured Addresses Purpose of logical address?
Gives more flexibility, when we Move the record around within the
block Move the record to another block Gives us an option of
deciding what to do when a record is deleted? Pointer Swizzling
Having pointers is common in an object-relational database systems
Important to learn about the management of pointers Every data item
(block, record, etc.) has two addresses: database address: address
on the disk memory address, if the item is in virtual memory
Slide 40
Translation Table: Maps database address to memory address All
addressable items in the database have entries in the map table,
while only those items currently in memory are mentioned in the
translation table DbaddrMem-addr Database address Memory
Address
Slide 41
Pointer consists of the following two fields Bit indicating the
type of address Database or memory address Example 13.17 Disk Block
2 Block 1 Memory Swizzled Unswizzled Block 1
Slide 42
Example 13.7 Block 1 has a record with pointers to a second
record on the same block and to a record on another block If Block
1 is copied to the memory The first pointer which points within
Block 1 can be swizzled so it points directly to the memory address
of the target record Since Block 2 is not in memory, we cannot
swizzle the second pointer Three types of swizzling Automatic
Swizzling As soon as block is brought into memory, swizzle all
relevant pointers.
Slide 43
Swizzling on Demand Only swizzle a pointer if and when it is
actually followed. No Swizzling Pointers are not swizzled they are
accesses using the database address. Unswizzling When a block is
moved from memory back to disk, all pointers must go back to
database (disk) addresses Use translation table again Important to
have an efficient data structure for the translation table
Slide 44
Pinned records and Blocks A block in memory is said to be
pinned if it cannot be written back to disk safely. If block B1 has
swizzled pointer to an item in block B2, then B2 is pinned Unpin a
block, we must unswizzle any pointers to it Keep in the translation
table the places in memory holding swizzled pointers to that item
Unswizzle those pointers (use translation table to replace the
memory addresses with database (disk) addresses
Slide 45
13.7 Records With Variable-Length Fields A simple but effective
scheme is to put all fixed length fields ahead of the
variable-length fields. We then place in the record header: 1. The
length of the record. 2. Pointers to (i.e., offsets of) the
beginnings of all the variable- length fields. However, if the
variable-length fields always appear in the same order then the
first of them needs no pointer; we know it immediately follows the
fixed-length fields.
Slide 46
Records With Repeating Fields A similar situation occurs if a
record contains a variable number of Occurrences of a field F, but
the field itself is of fixed length. It is sufficient to group all
occurrences of field F together and put in the record header a
pointer to the first. We can locate all the occurrences of the
field F as follows. Let the number of bytes devoted to one instance
of field F be L. We then add to the offset for the field F all
integer multiples of L, starting at 0, then L, 2L, 3L, and so on.
Eventually, we reach the offset of the field following F. Where
upon we stop.
Slide 47
An alternative representation is to keep the record of fixed
length, and put the variable length portion - be it fields of
variable length or fields that repeat an indefinite number of times
- on a separate block. In the record itself we keep: 1. Pointers to
the place where each repeating field begins, and 2. Either how many
repetitions there are, or where the repetitions end.
Slide 48
Variable-Format Records The simplest representation of
variable-format records is a sequence of tagged fields, each of
which consists of: 1. Information about the role of this field,
such as: (a) The attribute or field name, (b) The type of the
field, if it is not apparent from the field name and some readily
available schema information, and (c) The length of the field, if
it is not apparent from the type. 2. The value of the field. There
are at least two reasons why tagged fields would make sense.
Slide 49
1.Information integration applications - Sometimes, a relation
has been constructed from several earlier sources, and these
sources have different kinds of information For instance, our movie
star information may have come from several sources, one of which
records birthdates, some give addresses, others not, and so on. If
there are not too many fields, we are probably best off leaving
NULL those values we do not know. 2. Records with a very flexible
schema - If many fields of a record can repeat and/or not appear at
all, then even if we know the schema, tagged fields may be useful.
For instance, medical records may contain information about many
tests, but there are thousands of possible tests, and each patient
has results for relatively few of them
Slide 50 1. Then we partition R i">
BASIS: If k = 1, i.e., one pass is allowed, then we must have
B(R) < M. Put another way, s(M, 1) = Af. INDUCTION: Suppose k
> 1. Then we partition R into 1M pieces, each of which must be
sortable in k - 1 passes. If B(R) = s(M, k), then s(M, k)/:l17
which is the size of each of the M pieces of R, cannot exceed s(M,
k - 1). That is: s(M, k) = Ms(M, k - 1) Performance of Multipass,
Sort-Based Algorithms
Slide 178
Multipass Hash-Based Algorithms BASIS: For a unary operation,
if the relation fits in hl buffers, read it into memory and
perfor111 the operation. For a binary operation, if either relation
fits in,11 - I buffers, perform the operation by reading this
relation into main memory and then read the second relation, one
block at a time, into the Mth buffer. INDUCTION: If no relation
fits in main memory, then hash each relation into A 1 - 1 buckets,
as discussed in Section 15.5.1. Recursively perform the operation
on each bucket or corresponding pair of buckets, and accumulate the
output from each bucket or pair.
Slide 179
The Query Compiler 16.1 Parsing and Preprocessing Meghna
Jain(205) Dr. T. Y. Lin
Slide 180
Presentation Outline 16.1 Parsing and Preprocessing 16.1.1
Syntax Analysis and Parse Tree 16.1.2 A Grammar for Simple Subset
of SQL 16.1.3 The Preprocessor 16.1.4 Processing Queries Involving
Views
Slide 181
Query compilation is divided into three steps 1. Parsing: Parse
SQL query into parser tree. 2. Logical query plan: Transforms parse
tree into expression tree of relational algebra. 3.Physical query
plan: Transforms logical query plan into physical query plan..
Operation performed. Order of operation. Algorithm used. The way in
which stored data is obtained and passed from one operation to
another.
Slide 182
Parser Preprocessor Logical Query plan generator Query rewrite
Preferred logical query plan Query Form a query to a logical query
plan
Slide 183
Syntax Analysis and Parse Tree Parser takes the sql query and
convert it to parse tree. Nodes of parse tree: 1. Atoms: known as
Lexical elements such as key words, constants, parentheses,
operators such as +, < and other schema elements. 2. Syntactic
categories: Subparts that plays a similar role in a query as,
Slide 184
Grammar for Simple Subset of SQL The syntactic category is
intended to represent all well-formed queries of SQL. Some of its
rules are: ::= ::= ::= ( ) Select-From-Where Forms lie give the
syntactic category ::= SELECT FROM WHERE Select lists ::=, ::= From
lists: ::=, ::=
Slide 185
Conditions ::= AND ::= IN ::= = ::= LIKE ::= Atoms(constants),
(variable), ::= (can be expressed/defined as)
Slide 186
Query and Parse Tree StarsIn(title,year,starName)
MovieStar(name,address,gender,birthdate) Query: Give titles of
movies that have at least one star born in 1960 SELECT title FROM
StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE
birthdate LIKE '%1960%' );
Slide 187
Slide 188
Another query equivalent SELECT title FROM StarsIn, MovieStar
WHERE starName = name AND birthdate LIKE '%1960%' ;
Slide 189
Parse Tree SELECT FROM WHERE, AND title StarsIn = LIKE starName
name birthdate %1960 MovieStar
Slide 190
The Preprocessor Functions of Preprocessor. If a relation used
in the query is virtual view then each use of this relation in the
form-list must replace by parser tree that describe the view.. It
is also responsible for semantic checking 1. Checks relation uses :
Every relation mentioned in FROM- clause must be a relation or a
view in current schema. For instance, the preprocessor applied to
the parse tree 2. Check and resolve attribute uses: Every attribute
mentioned in SELECT or WHERE clause must be an attribute of same
relation in the current scope. For instance,attribute title in the
first select-list. 3. Check types: All attributes must be of a type
appropriate to their uses. Since birthdate is a date, and dates in
SQL can normally be treated as strings, this use of an attribute is
validated. Likewise, operators are checked to see that they apply
to values of appropriate and compatible types.
Slide 191
StarsIn(title,year,starName)
MovieStar(name,address,gender,birthdate) Query: Give titles of
movies that have at least one star born in 1960 SELECT title FROM
StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE
birthdate LIKE '%1960%' );
Slide 192
Preprocessing Queries Involving Views When an operand in a
query is a virtual view, the preprocessor needs to replace the
operand by a piece of parse tree that represents how the view is
constructed from base table. Base Table: Movies( title, year,
length, genre, studioname, producerC#) View definition : CREATE
VIEW ParamountMovies AS SELECT title, year FROM movies WHERE
studioName = 'Paramount'; Example based on view: SELECT title FROM
ParamountMovies WHERE year = 1979;
Slide 193
Thank You
Slide 194
Query Compiler By: Payal Gupta Roll No: 106(225) Professor:
Tsau Young Lin
Slide 195
Pushing Selections It is, replacing the left side of one of the
rules by its right side. In pushing selections we first a selection
as far up the tree as it would go, and then push the selections
down all possible branches.
Slide 196
Lets take an example: S t a r s I n ( t i t l e, year,
starName) Movie(title, year, length, incolor, studioName,
producerC#) Define view MoviesOf 1996 by: CREATE VIEW MoviesOfl996
AS SELECT * FROM Movie,WHERE year = 1996;
Slide 197
"which stars worked for which studios in 1996? can be given by
a SQL Query: SELECT starName, studioName FROM MoviesOfl996 NATURAL
JOIN StarsIn;
Slide 198
starName,studioName O Year=1996 StarsIn Movie Logical query
plan constructed from definition of a query and view
Slide 199
starName,studioName O Year=1996 StarsIn Movie Year=1996 O
Improving the query plan by moving selections up and down the
tree
Slide 200
"pushing" projections really involves introducing a new
projection somewhere below an existing projection. projection keeps
the number of tuples the same and only reduces the length of
tuples. To describe the transformations of extended projection
Consider a term E + x on the list for a projection, where E is an
attribute or an expression involving attributes and constants and x
is an output attribute. Laws Involving Projection
Slide 201
Example Let R(a, b, c) and S(c, d, e) be two relations.
Consider the expression x,+,,,, b+y(R w S). The input attributes of
the projection are a,b, and e, and c is the only join attribute. We
may apply the law for pushing projections below joins to get the
equivalent expression: a+e->x,b->y(a,b,c(R) c,e(S))
Slide 202
Eliminating this projection and getting a third equivalent
expression:a+e->x, b->y( R c,e(S)) In addition, we can
perform a projection entirely before a bag union. That is: L(R UB
S)= L(R) )UB L(S)
Slide 203
Laws About Joins and Products laws that follow directly from
the definition of the join: R c S = c( R * S) R S = L( c ( R * S)
), where C is the condition that equates each pair of attributes
from R and S with the same name. and L is a list that includes one
attribute from each equated pair and all the other attributes of R
and S. We identify a product followed by a selection as a join of
some kind. O O
Slide 204
Laws Involving Duplicate Elimination The operator which
eliminates duplicates from a bag can be pushed through many but not
all operators. In general, moving a down the tree reduces the size
of intermediate relations and may therefore beneficial. Moreover,
sometimes we can move to a position where it can be eliminated
altogether,because it is applied to a relation that is known not to
possess duplicates.
Slide 205
(R)=R if R has no duplicates. Important cases of such a
relation R include: a) A stored relation with a declared primary
key, and b) A relation that is the result of a operation, since
grouping creates a relation with no duplicates.
Slide 206
Several laws that "push" through other operators are: (R*S)
=(R) * (S) (R S)=(R) (S) (R c S)=(R) c (S) ( c (R))= c ((R)) We can
also move the to either or both of the arguments of an
intersection: (R B S) = (R) B S = R B (S) = (R) B (S) OO
Slide 207
Laws Involving Grouping and Aggregation When we consider the
operator , we find that the applicability of many transformations
depends on the details of the aggregate operators used. Thus we
cannot state laws in the generality that we used for the other
operators. One exception is that a absorbs a . Precisely: ( L (R))=
L (R)
Slide 208
let us call an operator duplicate-impervious if the only
aggregations in L are MIN and/or MAX then: L(R) = L ((R)) provided
L is duplicate- impervious.
Slide 209
Example Suppose we have the relations MovieStar(name, addr,
gender, birthdate) StarsIn(movieTitle, movieyear, starname) and we
want to know for each year the birthdate of the youngest star to
appear in a movie that year. We can express this query as: SELECT
movieyear, MAX(birth date) FROM MovieStar, StarsIn WHERE name =
starName GROUP BY movieyear;
Slide 210
movieYear, MAX ( birthdate ) name = starName MovieStar StarsIn
Initial logical query plan for the query O
Slide 211
Some transformations that we can apply to Fig are 1. Combine
the selection and product into an equijoin. 2.Generate a below the
, since the is duplicate- impervious. 3. Generate a between the and
the introduced to project onto movie-Year and birthdate, the only
attributes relevant to the
Slide 212
movieYear, MAX ( birthdate ) movieYear, birthdate name =
starName MovieStar StarsIn Another query plan for the query
Slide 213
movieYear, MAX ( birthdate ) movieYear, birthdate name =
starName birthdate,name movieYear,starname MovieStar StarsIn third
query plan for Example
Slide 214
The Query Compiler Section 16.3 DATABASE SYSTEMS The Complete
Book Presented By:Under the supervision of: Deepti KunduDr.
T.Y.Lin
Slide 215
Review Query Preferred logical query plan Parser Preprocessor
Logical query plan generator Query Rewriter Section 16.1 Section
16.3
Slide 216
Two steps to turn Parse tree into Preferred Logical Query Plan
Replace the nodes and structures of the parse tree, in appropriate
groups, by an operator or operators of relational algebra. Take the
relational algebra expression and turn it into an expression that
we expect can be converted to the most efficient physical query
plan.
Slide 217
Reference Relations StarsIn (movieTitle, movieYear, starName)
MovieStar (name, address, gender, birthdate) Conversion to
Relational Algebra If we have a with a that has no subqueries, then
we may replace the entire construct the select-list, from-list, and
condition by a relational-algebra expression.
Slide 218
The relational-algebra expression consists of the following
from bottom to top: The products of all the relations mentioned in
the, which Is the argument of: A selection C, where C is the
expression in the construct being replaced, which in turn is the
argument of: A projection L, where L is the list of attributes in
the Example: SELECT movieTitle FROM Starsin, MovieStar WHERE
starName = name AND birthdate LIKE %1960;
Slide 219
SELECT movieTitle FROM Starsin, MovieStar WHERE starName = name
AND birthdate LIKE %1960;
Slide 220
Translation to an algebraic expression tree
Slide 221
Removing Subqueries From Conditions For parse trees with a that
has a subquery Intermediate operator two argument selection It is
intermediate in between the syntactic categories of the parse tree
and the relational- algebra operators that apply to relations.
Slide 222
Using a two-argument movieTitle StarsIn MovieStar IN name
starName birthdate LIKE %1960'
Slide 223
Two argument selection with condition involving IN Now say we
have, two arguments some relation and the second argument is a of
the form t IN S. t tuple composed of some attributes of R S
uncorrelated subquery Steps to be followed: 1.Replace the by the
tree that is the expression for S ( is used to remove duplicates)
2.Replace the two-argument selection by a one-argument selection C.
3.Give C an argument that is the product of R and S.
Slide 224
Two argument selection with condition involving IN R tINS CC X
R S
Slide 225
The effect
Slide 226
Improving the Logical Query Plan Algebraic laws to improve
logical query plans: Selections can be pushed down the expression
tree as far as they can go. Similarly, projections can be pushed
down the tree, or new projections can be added. Duplicate
eliminations can sometimes be removed, or moved to a more
convenient position in the tree. Certain selections can be combined
with a product below to turn the pair of operations into an
equijoin.
Slide 227
Grouping Associative/ Commutative Operators An operator that is
associative and commutative operators may be though of as having
any number of operands. We need to reorder these operands so that
the multiway join is executed as sequence of binary joins. Its more
time consuming to execute them in the order suggested by parse
tree. For each portion of subtree that consists of nodes with the
same associative and commutative operator (natural join, union, and
intersection), we group the nodes with these operators into a
single node with many children.
Slide 228
The effect of query rewriting movieTitle Starname = name
StarsIn birthdate LIKE %1960 MovieStar
Slide 229
Final step in producing logical query plan => U U U W R ST
VU UV W R S T
Slide 230
An Example to summarize find movies where the average age of
the stars was at most 40 when the movie was made SELECT distinct
m1.movieTitle, m1,movieYear FROM StarsIn m1 WHERE m1.movieYear
40
Notation for Physical Query Plans (cont.) Example of a
physical-query-plan A physical-query-plan in example 16.36 for the
case k > 5000 TableScan Two-pass hash join Materialize (double
line) Store operator 308
Slide 309
Notation for Physical Query Plans (cont.) Another example A
physical-query-plan in example 16.36 for the case k < 49
TableScan (2) Two-pass hash join Pipelining Different buffers needs
Store operator 309
Slide 310
Notation for Physical Query Plans (cont.) A physical-query-plan
in example 16.35 Use Index on condition y = 2 first Filter with the
rest condition later on. 310
Slide 311
VII. Ordering of Physical Operations The PQP is represented as
a tree structure implied order of operations. Still, the order of
evaluation of interior nodes may not always be clear. Iterators are
used in pipeline manner Overlapped time of various nodes will make
ordering no sense. 311
Slide 312
Ordering of Physical Operations (cont.) 3 rules summarize the
ordering of events in a PQP tree: 1.Break the tree into sub-trees
at each edge that represent materialization. Execute one subtree at
a time. 2.Order the execution of the subtree Bottom-top
Left-to-right 3.All nodes of each sub-tree are executed
simultaneously. 312
Slide 313
Reference [1] H. Garcia-Molina, J. Ullman, and J. Widom,
Database System: The Complete Book, second edition: p.897-913,
Prentice Hall, New Jersey, 2008 313
Slide 314
Chapter:18 18.1 Serial and Serializable Schedule A process of
assuming that the transactions preserve the consistency when
executing simultaneously is called Concurrency Control. This
consistency is taken care by Scheduler. Concurrency control in
database management systems (DBMS) ensures that database
transactions are performed concurrently without the concurrency
violating the data integrity of a database. Executed transactions
should follow the ACID rules. The DBMS must guarantee that only
serializable (unless Serializability is intentionally relaxed),
recoverable schedules are generated.
Slide 315
It also guarantees that no effect of committed transactions is
lost, and no effect of aborted (rolled back) transactions remains
in the related database. ACID rules ACID Atomicity - Either the
effects of all or none of its operations remain when a transaction
is completed - in other words, to the outside world the transaction
appears to be indivisible, atomic. Atomicity Consistency - Every
transaction must leave the database in a consistent state.
Consistencytransaction consistent state Isolation - Transactions
cannot interfere with each other. Providing isolation is the main
goal of concurrency control. Isolation Durability - Successful
transactions must persist through crashes. Durabilitycrashes
Slide 316
In the field of databases, a schedule is a list of actions,
(i.e. reading, writing, aborting, committing), from a set of
transactions. In this example, Schedule D is the set of 3
transactions T1, T2, T3. The schedule describes the actions of the
transactions as seen by the DBMS. T1 Reads and writes to object X,
and then T2 Reads and writes to object Y, and finally T3 Reads and
writes to object Z. This is an example of a serial schedule,
because the actions of the 3 transactions are not interleaved.
Slide 317
Serial and Serializable Schedules: A schedule that is
equivalent to a serial schedule has the serializability property.
In schedule E, the order in which the actions of the transactions
are executed is not the same as in D, but in the end, E gives the
same result as D.
Slide 318
Serial Schedule TI precedes T2 T1T2AB50 READ (A,t) t := t+100
WRITE (A,t)150 READ (A,s) s := s*2 WRITE (A,s)300 READ (B,t) t :=
t+100 WRITE (B,t)150 READ (B,s) s := s*2 WRITE (B,s)300
Slide 319
Non-Serializable Schedule T1T2AB50 READ (A,t) t := t+100 WRITE
(A,t)150 READ (A,s) s := s*2 WRITE (A,s)300 READ (B,s) s := s*2
WRITE (B,s)100 READ (B,t) t := t+100 WRITE (B,t)200
Slide 320
A Serializable Schedule with details T1T2AB50 READ (A,t) t :=
t+100 WRITE (A,t)150 READ (A,s) s := s*1 WRITE (A,s)150 READ (B,s)
s := s*1 WRITE (B,s)50 READ (B,t) t := t+100 WRITE (B,t)150
Slide 321
18.2 Conflict Serializability Non-Conflicting Actions Two
actions are non-conflicting if whenever they occur consecutively in
a schedule, swapping them does not affect the final state produced
by the schedule. Otherwise, they are conflicting. Conflicting
Actions: General Rules Two actions of the same transaction
conflict: r1(A) w1(B) Two actions over the same database element
conflict, if one of them is a write r1(A) w2(A) w1(A) w2(A)
Slide 322
Conflict Serializable: We may take any schedule and make as
many nonconflicting swaps as we wish. With the goal of turning the
schedule into a serial schedule. If we can do so, then the original
schedule is serializable, because its effect on the database state
remains the same as we perform each of the nonconflicting swaps. A
schedule is said to be conflict-serializable when the schedule is
conflict-equivalent to one or more serial schedules. Another
definition for conflict-serializability is that a schedule is
conflict-serializable if and only if there exists an acyclic
precedence graph/serializability graph for the schedule. Which is
conflict-equivalent to the serial schedule, but not.
Slide 323
Conflict equivalent / conflict-serializable Let Ai and Aj are
consecutive non-conflicting actions that belongs to different
transactions. We can swap Ai and Aj without changing the result.
Two schedules are conflict equivalent if they can be turned one
into the other by a sequence of non-conflicting swaps of adjacent
actions. We shall call a schedule conflict-serializable if it is
conflict- equivalent to a serial schedule Test for
conflict-serializability Construct the precedence graph for S and
observe if there are any cycles. If yes, then S is not
conflict-serializable Else, it is a conflict-serializable
schedule.
Slide 324
Example of a cyclic precedence graph: Consider the below
schedule S 1 : r 2 (A); r 1 (B); w 2 (A); r 2 (B); r 3 (A); w 1
(B); w 3 (A); w 2 (B); Observing the actions of A in the previous
example (figure 2), we can find that T 2
Expected exceptions 1. Suppose there is a transaction U, such
that: U is in VAL or FIN; that is, U has validated,
FIN(U)>START(T); that is, U did not finish before T started
RS(T) WS(T) ; let it contain database element X. 2. Suppose there
is transaction U, such that: U is in VAL; U has successfully
validated. FIN(U)>VAL(T); U did not finish before T entered its
validation phase. WS(T) WS(U) ; let x be in both write sets.
Slide 367
Validation rules Check that RS(T) WS(U)= for any previously
validated U that did not finish before T has started i.e.
FIN(U)>START(T). Check that WS(T) WS(U)= for any previously
validated U that did not finish before T is validated i.e.
FIN(U)>VAL(T)
Slide 368
Solution Validation of U: Nothing to check Validation of T:
WS(U) RS(T)= {D} {A,B}= WS(U) WS(T)= {D} {A,C}= Validation of V:
RS(V) WS(T)= {B}{A,C}= WS(V) WS(T)={D,E} {A,C}= RS(V) WS(U)={B}
{D}= Validation of W: RS(W) WS(T)= {A,D}{A,C}={A} WS(W) WS(V)=
{A,D}{D,E}={D} WS(W) WS(V)= {A,C}{D,E}=(W is not validated) wrapper
for cars of a given color ($c): SELECT * FROM AutoMed WHERE color =
$c; => SELECT serialNo,model,color,autoTrans,dealer1 FROM Cars
WHERE color = $c; Wrapper Template describing queries for cars of a
given color Templates needed: Pow (2,n) for n attributes For all
possible queries from the mediator">
Template for Query Patterns Design a wrapper Build templates
for all possible queries that the mediator can ask. Mediator
schema: AutosMed (serialNo,model,color,autoTrans,dealer) Source
schema: Cars (serialNo,model,color,autoTrans,navi,) Mediator ->
wrapper for cars of a given color ($c): SELECT * FROM AutoMed WHERE
color = $c; => SELECT serialNo,model,color,autoTrans,dealer1
FROM Cars WHERE color = $c; Wrapper Template describing queries for
cars of a given color Templates needed: Pow (2,n) for n attributes
For all possible queries from the mediator
Slide 408
Wrapper Generators The software that creates the wrapper is
Wrapper Generator. Driver Table Source Templates Wrapper Generator
Queries Results Wrapper
Slide 409
Wrapper Generators Wrapper Generator: Creates a table that
holds the various query patterns contained in templates. Source
queries associated with each of them. The Driver: Accept a query
from the mediator. Search the table for a template that matches the
query. Send the query to the source. Return the response to the
Mediator.
Slide 410
Filters Consider the Car dealers database. The Wrapper template
to get the cars of a given model and color is: SELECT * FROM
AutoMed WHERE model = $m and color = $c; => SELECT
serialNo,model,color,autoTrans,dealer1 FROM Cars WHERE model = $m
and color = $c; Another approach is to have a Wrapper Filter: The
Wrapper has a template that returns a superset of what the query
wants. Filter the returned tuples at the Wrapper and pass only the
desired tuples. Position of the Filter Component: At the Wrapper At
the Mediator
Slide 411
Filters To find the blue cars of model Ford: Use the template
to extract the blue cars. Return the tuples to the Mediator. Filter
to get the Ford model cars at the Mediator. Store at the temporary
relation: TempAutos (serialNo,model,color,autoTrans,dealer) Filter
by executing a local query: SELECT * FROM TempAutos WHERE model =
FORD;
Slide 412
Other Operations at the Wrapper It is possible to take the
joins at the Wrapper and transmit the result to Mediator. Suppose
the Mediator is asked to find dealers and models such that the
dealer has two red cars, of the same model, one with and one
without automatic transmission: SELECT A1.model, A1.dealer FROM
AutosMed A1, AutosMed A2 WHERE A1.model = A2.model AND A1.color =
red AND A2.color = red AND A1.autoTrans = no and A2.autoTrans =
yes; Wrapper can first obtain all the red cars: SELECT * FROM
AutosMed WHERE color = red; RedAutos
(serialNo,model,color,autoTrans,dealer)
Slide 413
Other Operations at the Wrapper The Wrapper then performs a
join and the necessary selection. SELECT DISTINCT A1.model,
A1.dealer FROM RedAutos A1, RedAutos A2 WHERE A1.model = A2.model
AND A1.autoTrans = no AND A2.autoTrans = yes;
Slide 414
Thank You
Slide 415
Sections 21.4 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257
Dr. TY Lin INFORMATION INTEGRATION
Slide 416
21.4 Capability Based Optimization Introduction Typical DBMS
estimates the cost of each query plan and picks what it believes to
be the best Mediator has knowledge of how long its sources will
take to answer Optimization of mediator queries cannot rely on cost
measure alone to select a query plan Optimization by mediator
follows capability based optimization
Slide 417
21.4.1 The Problem of Limited Source Capabilities Many sources
have only Web Based interfaces Web sources usually allow querying
through a query form E.g. Amazon.com interface allows us to query
about books in many different ways. General questions like: Select
* from books should not be asked
Slide 418
21.4.1 The Problem of Limited Source Capabilities (cont)
Reasons why a source may limit the ways in which queries can be
asked Earliest database did not use relational DBMS that supports
SQL queries Indexes on large database may make certain queries
feasible, while others are too expensive to execute Security
reasons E.g. CS Dept answer queries about average salary, but wont
disclose details of a particular professors information
Slide 419
21.4.2 A Notation for Describing Source Capabilities For
relational data, legal forms of queries are described by adornments
Adornments Sequences of codes that represent the requirements for
the attributes of the relation, in their standard order f(free)
attribute can be specified or not b(bound) must specify a value for
an attribute but any value is allowed u(unspecified) not permitted
to specify a value for a attribute
Slide 420
21.4.2 A notation for Describing Source Capabilities.(contd)
c[S](choice from set S) means that a value must be specified and
value must be from finite set S. o[S](optional from set S) means
either do not specify a value or we specify a value from finite set
S A prime (f) specifies that an attribute is not a part of the
output of the query The specification of capabilities is a set of
adornments as defined above A query must match one of the
adornments in its capabilities specification
Slide 421
21.4.2 A notation for Describing Source Capabilities.(contd)
E.g. Dealer 1 is a source of data in the form: Cars (serialNo,
model, color, autoTrans, navi) The adornment for this query form is
buuuu
Slide 422
21.4.3 Capability-Based Query-Plan Selection Given a query at
the mediator, a capability based query optimizer first considers
what queries it can ask at the sources to help answer the query The
process is repeated until: Enough queries are asked at the sources
to resolve all the conditions of the mediator query and therefore
query is answered. Such a plan is called feasible. no more valid
forms of source queries can be constructed, yet still cannot answer
the mediator query.
Slide 423
21.4.3 Capability-Based Query-Plan Selection (contd) The
simplest form of mediator query where we need to apply the above
strategy is join relations E.g. we have sources for dealer 2 Autos
(serial, model, color) Options (serial, option) Suppose that ubf is
the sole adornment for Auto and Options have two adornments, bu and
uc[autoTrans, navi] Query is to find serial numbers and colors of
McLaren models with a navigation system
Slide 424
21.4.4 Adding Cost-Based Optimization Mediators Query optimizer
is not done when the capabilities of the sources are examined Query
optimizer chooses among the available feasible plans Making an
intelligent, cost based query optimization requires that the
mediator knows a great deal about the costs of queries involved
Sources are independent of the mediator, so it is difficult to
estimate the cost
Slide 425
21.5 Optimizing Mediator Queries Chain algorithm a greedy
algorithm answers the query by sending a sequence of requests to
its sources. Always finds a solution if it exists The solution may
not be optimal.
Slide 426
21.5.1 Simplified Adornment Notation A query at the mediator is
limited to b (bound) and f (free) adornments. We use the following
convention for describing adornments: name adornments (attributes)
where: name is the name of the relation No. of adornments = no. of
attributes
Slide 427
21.5.2 Obtaining Answers for Subgoals Rules for subgoals and
sources: Suppose we have the following subgoal: R x 1 x 2 x n (a 1,
a 2, , a n ), and source adornments for R are: y 1 y 2 y n. If y i
is b or c[S], then x i = b. If x i = f, then y i is not output
restricted. The adornment on the subgoal matches the adornment at
the source: If yi is f, u, or o[S] and xi is either b or f.
Slide 428
21.5.3 The Chain Algorithm Maintains 2 types of information: An
adornment for each subgoal. A relation X that is the join of the
relations for all the subgoals that have been resolved. The
adornment for a subgoal is b if the mediator query provides a
constant binding for the corresponding argument of that subgoal. X
is a relation over no attributes, containing just an empty
tuple.
Slide 429
21.5.3 The Chain Algorithm (cont) First, initialize adornments
of subgoals and X. Then, repeatedly select a subgoal that can be
resolved. Let R (a 1, a 2, , a n ) be the subgoal: Wherever has a
b, we shall find the argument in R is a constant, or a variable in
the schema of R. Project X onto its variables that appear in
R.
Slide 430
21.5.3 The Chain Algorithm (cont) 2.For each tuple t in the
project of X, issue a query to the source as follows ( is a source
adornment). If has b, then the corresponding component of has b,
and we can use the corresponding component of t for source query.
If ha s c[S], and the corresponding component of t is in S, then
the corresponding component of has b, and we can use the
corresponding component of t for the source query. If has f, and
the corresponding component of is b, provide a constant value for
source query.
Slide 431
21.5.3 The Chain Algorithm (cont) If a component of is u, then
provide no binding for this component in the source query. If a
component of is o[S], and the corresponding component of is f, then
treat it as if it was a f. If a component of is o[S], and the
corresponding component of is b, then treat it as if it was c[S].
3.Every variable among a 1, a 2, , a n is now bound.
Slide 432
21.5.3 The Chain Algorithm (cont) 4.Replace X with X s(R),
where S is all of the variables among: a 1, a 2, , a n. 5.Project
out of X all components that correspond to variables that do not
appear in the head or in any unresolved subgoal. If every subgoal
is resolved, then X is the answer.
Slide 433
21.5.3 The Chain Algorithm Example Mediator query: Q: Answer(c)
Rbf(1,a) AND Sff(a,b) AND Tff(b,c) Example: Relation R S T Data
Adornment bfc[2,3,5]f bu wx 12 13 14 xy 24 35 yz 46 57 58
Slide 434
21.5.3 The Chain Algorithm Example (cont) Initially, the
adornments on the subgoals are the same as Q, and X contains an
empty tuple. S and T cannot be resolved as they each have ff
adornments, but the sources have either a, b or c. R(1,a) can be
resolved because its adornments are matched by the sources
adornments. Send R(w,x) with w=1 to get the tables on the previous
page.
Slide 435
21.5.3 The Chain Algorithm Example (cont) Project the subgoals
relation onto its second component, since only the second component
of R(1,a) is a variable. This is joined with X, resulting in X
equaling this relation. Change adornment on S from ff to bf. a 2 3
4
Slide 436
21.5.3 The Chain Algorithm Example (cont) Now we resolve S bf
(a,b): Project X onto a, resulting in X. Now, search S for tuples
with attribute a equivalent to attribute a in X. Join this relation
with X, and remove a as it doesnt appear in the head nor any
unresolved subgoal: aB 24 35 b 4 5
Slide 437
21.5.3 The Chain Algorithm Example (cont) Now we resolve T bf
(b,c): Join this relation with X and project onto the c attribute
to get the relation for the head. Solution is {(6), (7), (8)}. bc
46 57 58
Slide 438
21.5.4 Incorporating Union Views at the Mediator This
implementation of the Chain Algorithm does not consider that
several sources can contribute tuples to a relation. If specific
sources have tuples to contribute that other sources may not have,
it adds complexity. To resolve this, we can consult all sources, or
make best efforts to return all the answers.
Slide 439
21.5.4 Incorporating Union Views at the Mediator (cont)
Consulting All Sources A subgoal is resolved when each source for
its relation has an adornment matched by current adornment of
subgoal. Less practical because it makes queries harder to answer
and impossible if any source is down. Best Efforts We need only 1
source with a matching adornment to resolve a subgoal. Need to
modify chain algorithm to revisit each subgoal when that subgoal
has new bound requirements.
Local-as-View Mediators. In a LAV mediator, global predicates
defined are not views of the source data. Expressions are defined
for each source with global predicates that describe tuples that
source produces Mediator answers the queries by constructing the
views as provided by the source.
Slide 442
Motivation for LAV Mediators Relationship between the data
provided by the mediator and the sources is more subtle For
example, consider the predicate Par(c, p) meaning that p is a
parent of c which represents the set of all child parent facts that
could ever exist. The sources will provide information about
whatever child-parent facts they know.
Slide 443
Motivation(contd..) There can be sources which may provide
child-grandparent facts but not child- parent facts at all. This
source can never be used to answer the child-parent query under GAV
mediators. LAV mediators allow to say that a certain source
provides grand parent facts. Used to discover how and when to use
the source in a given query.
Slide 444
Terminology for LAV Mediation. The queries at mediator and
those describing the source will be single Datalog rules A single
Datalog rule is called a conjunctive query The global predicates of
LAV mediator are used as subgoals of mediator queries. Conjunctive
queries define views. Their heads each have a unique view predicate
that is name of a view. Each view definition consists of global
predicates and is associated with a particular source. Each view is
constructed with an all-free adornment.
Slide 445
Example.. Consider global predicate Par(c, p) meaning that p is
a parent of c. One source produces parent facts. Its view is
defined by the conjunctive query- V 1 (c, p) Par(c, p) Another
source produces some grand parents facts. Then its conjunctive
query will be V 2 (c, g) Par(c, p) AND Par(p, g)
Slide 446
Example contd.. The query at mediator will ask for great-grand
parent facts to be obtained from sources: Q(w, z) Par(w, x) AND
Par(x, y) AND Par(y, z) One solution can be using the parent
predicate(V 1 ) directly three times. Q(w, z) V 1 (w, x) AND V 1
(x, y) AND V 1 (y, z) Another solution can be to use V 1 (parent
facts) and V 2 (grandparent facts). Q(w, z) V 1 (w, x) AND V 2 (x,
z) Or Q(w, z) V 2 (w, y) AND V 1 (y, z)
Slide 447
Expanding Solutions. Consider a query Q, a solution S that has
a body whose subgoals are views and each view V is defined by a
conjunctive query with that view as the head. The body of Vs
conjunctive query can be substituted for a subgoal in S that uses
the predicate V to have a body consisting of only global
predicates.
Slide 448
Expansion Algorithm A solution S has a subgoal V(a 1, a 2,an)
where a i s can be any variables or constants. The view V can be of
the form V(b 1, b 2,.b n ) B Where B represents the entire body.
V(a 1, a 2, a n ) can be replaced in solution S by a version of
body B that has all the subgoals of B with variables possibly
altered.
Slide 449
Expansion Algorithm contd.. The rules for altering the
variables of B are: 1.First identify the local variables B,
variables that appear in the body but not in the head. 2.If there
are any local variables of B that appear in B or in S, replace each
one by a distinct new variable that appears nowhere in the rule for
V or in S. 3.In the body B, replace each b i by a i for i =
1,2n.
Slide 450
Example. Consider the view definitions, V 1 (c, p) Par(c, p) V
2 (c, g) Par(c, p) AND Par(p, g) One of the proposed solutions S is
Q(w, z) V 1 (w, x) AND V 2 (x, z) The first subgoal with predicate
V 1 in the solution can be expanded as Par(w, x) as there are no
local variables.
Slide 451
Example Contd. The V2 subgoal has a local variable p which
doesnt appear in S nor it has been used as a local variable in
another substitution. So p can be left as it is. Only x and z are
to be substituted for variables c and g. The Solution S now will be
Q(w, z) Par(w, x) AND Par(x, p) AND Par(p,z)
Slide 452
Containment of Conjunctive Queries A containment mapping from Q
to E is a function from the variables of Q to the variables and
constants of E, such that: 1.If x is the ith argument of the head
of Q, then (x) is the ith argument of the head of E. 2.Add to the
rule that (c)=c for any constant c. If P(x 1,x 2, x n ) is a
subgoal of Q, then P((x 1 ), (x 2 ), (x n )) is a subgoal of
E.
Slide 453
Example. Consider two Conjunctive queries: Q 1 : H(x, y) A(x,
z) and B(z, y) Q 2 : H(a, b) A(a, c) AND B(d, b) AND A(a, d) When
we apply the substitution, (x) = a, (y) = b, (z) = d, the head of Q
1 becomes H(a, b) which is the head of Q 2. So,there is a
containment mapping from Q 1 to Q 2.
Slide 454
Example contd.. The first subgoal of Q 1 becomes A(a, d) which
is the third subgoal of Q 2. The second subgoal of Q 1 becomes the
second subgoal of Q 2. There is also a containment mapping from Q 2
to Q 1 so the two conjunctive queries are equivalent.
Slide 455
Why the Containment-Mapping Test Works Suppose there is a
containment mapping from Q 1 to Q 2. When Q 2 is applied to the
database, we look for substitutions for all the variables of Q 2.
The substitution for the head becomes a tuple t that is returned by
Q 2. If we compose and then , we have a mapping from the variables
of Q 1 to tuples of the database that produces the same tuple t for
the head of Q 1.
Slide 456
Finding Solutions to a Mediator Query There can be infinite
number of solutions built from the views using any number of
subgoals and variables. LMSS Theorem can limit the search which
states that If a query Q has n subgoals, then any answer produced
by any solution is also produced by a solution that has at most n
subgoals. If the conjunctive query that defines a view V has in its
body a predicate P that doesnt appear in the body of the mediator
query, then we need not consider any solution that uses V.
Slide 457
Example. Recall the query Q1: Q(w, z) Par(w, x) AND Par(x, y)
AND Par(y, z) This query has three subgoals, so we dont have to
look at solutions with more than three subgoals.
Slide 458
Why the LMSS Theorem Holds Suppose we have a query Q with n
subgoals and there is a solution S with more than n subgoals. The
expansion E of S must be contained in Query Q, which means that
there is a containment mapping from Q to E. We remove from S all
subgoals whose expansion was not the target of one of Qs subgoals
under the containment mapping.
Slide 459
Contd.. We would have a new conjunctive query S with at most n
subgoals. If E is the expansion of S then, E is a subset of Q. S is
a subset of S as there is an identity mapping. Thus S need not be
among the solutions to query Q.
Slide 460
Information Integration Entity Resolution 21.7 Presented By:
Deepti Bhardwaj Roll No: 223_103
Slide 461
Contents 21.7 Entity Resolution 21.7.1 Deciding Whether Records
Represent a Common Entity 21.7.2 Merging Similar Records 21.7.3
Useful Properties of Similarity and Merge Functions 21.7.4 The
R-Swoosh Algorithm for ICAR Records 21.7.5 Other Approaches to
Entity Resolution
Slide 462
Introduction ENTITY RESOLUTION: Entity resolution is a problem
that arises in many information integration scenarios. It refers to
determining whether two records or tuples do or do not represent
the same person, organization, place or other entity.
Slide 463
Deciding whether Records represent a Common Entity Two records
represent the same individual if the two records have similar
values for each of the fields associated with those records. It is
not sufficient that the values of corresponding fields be identical
because of following reasons: 1. Misspellings 2. Variant Names 3.
Misunderstanding of Names
Slide 464
Continue: Deciding whether Records represent a Common Entity 4.
Evolution of Values 5. Abbreviations Thus when deciding whether two
records represent the same entity, we need to look carefully at the
kinds of discrepancies and use the test that measures the
similarity of records.
Slide 465
Deciding Whether Records Represents a Common Entity - Edit
Distance First approach to measure the similarity of records is
Edit Distance. Values that are strings can be compared by counting
the number of insertions and deletions of characters it takes to
turn one string into another. So the records represent the same
entity if their similarity measure is below a given threshold.
Slide 466
Deciding Whether Records Represents a Common Entity -
Normalization To normalize records by replacing certain substrings
by others. For instance: we can use the table of abbreviations and
replace abbreviations by what they normally stand for. Once
normalize we can use the edit distance to measure the difference
between normalized values in the fields.
Slide 467
Merging Similar Records Merging refers to removal of redundant
data in two records. There are many merge rules: 1. Set the field
in which the records disagree to the empty string. 2. (i) Merge by
taking the union of the values in each field (ii) Declare two
records similar if at least two of the three fields have a nonempty
intersection.
Slide 468
Continue: Merging Similar Records Name Address Phone 1. Susan
123 Oak St. 818-555-1234 2. Susan 456 Maple St. 818-555-1234 3.
Susan 456 Maple St. 213-555-5678 After Merging Name Address Phone
(1-2-3) Susan {123 Oak St.,456 Maple St} {818-555-1234, 213-
555-5678}
Slide 469
Useful Properties of Similarity and Merge Functions The
following properties say that merge operation is a semi lattice:
1.Idempotence: Merge of a record with itself yeilds the same
record. 2.Commutativity: Order of merged records does not matter
3.Associativity : The order in which we group records for a merger
should not matter.
Slide 470
Continue: Useful Properties of Similarity and Merge Functions
There are some other properties that we expect similarity
relationship to have: Idempotence for similarity: A record is
always similar to itself Commutativity of similarity: In deciding
whether two records are similar it does not matter in which order
we list them Representability: If r is similar to some other record
s, but s is instead merged with some other record t, then r remains
similar to the merger of s and t and can be merged with that
record.
Slide 471
R-swoosh Algorithm for ICAR Records Input: A set of records I,
similarity function and a merge function. Output: A set of merged
records O. Method: O:= emptyset; WHILE I is not empty DO BEGIN Let
r be any record in I; Find, if possible, some record s in O that is
similar to r; IF no record s exists THEN move r from I to O ELSE
BEGIN delete r from I; delete s from O; add the merger of r and s
to I; END;
Slide 472
Other Approaches to Entity Resolution The other approaches to
entity resolution are : Non- ICAR Datasets Clustering
Partitioning
Slide 473
Other Approaches to Entity Resolution - Non ICAR Datasets Non
ICAR Datasets : We can define a dominance relation r