CS 257: Database System Principles Final Exam by Ronak Shah (214) SJSU ID: 006260709 under guidance of Dr. Lin

Sections 13.1 13.3 Sanuja Dabade & Eilbroun Benjamin CS 257 Dr. TY Lin SECONDARY STORAGE MANAGEMENT

13.1.1 Memory Hierarchy Data storage capacities varies for different data Cost/byte to store data also varies Device with smallest capacity offer the fastest speed with highest cost per bit

Memory Hierarchy Diagram Programs, DBMS Main Memory DBMSs Main Memory Cache As Visual Memory Disk File System Tertiary Storage

13.1.1 Memory Hierarchy Cache Lowest level of the hierarchy Data items are copies of certain locations of main memory Sometimes, values in cache are changed and corresponding changes to main memory are delayed Machine looks for instructions as well as data for those instructions in the cache Amount of data that can be cached is limited No need to update the data in main memory immediately in a single processor computer In multiple processors data is updated immediately to main memory.called as write through

Main Memory Refers to physical memory that is internal to the computer. The word main is used to distinguish it from external mass storage devices such as disk drives. Everything happens in the computer i.e. instruction execution, data manipulation, as working on information that is resident in main memory Main memories are random access.one can obtain any byte in the same amount of time

Secondary storage Used to store data and programs when they are not being processed More permanent than main memory, as data and programs are retained when the power is turned off A PC might only require 20,000 bytes of secondary storage E.g. magnetic disks, hard disks

Tertiary Storage Consists of one to several storage drives. It is a comprehensive computer storage system that is usually very slow, so it is usually used to archive data that is not accessed frequently. Holds data volumes in terabytes Used for databases much larger than what can be stored on disk

13.1.2 Transfer of Data Between levels Data moves between adjacent levels of the hierarchy At the secondary or tertiary levels accessing the desired data or finding the desired place to store the data takes a lot of time Disk is organized into bocks Entire blocks are moved to and from memory called a buffer A key technique for speeding up database operations is to arrange the data so that when one piece of data block is needed it is likely that other data on the same block will be needed at the same time Same idea applies to other hierarchy levels

13.1.3 Volatile and Non Volatile Storage A volatile device forgets what data is stored on it after power off Non volatile holds data for longer period even when device is turned off Secondary and tertiary devices are non volatile Main memory is volatile

13.1.4 Virtual Memory computer system technique which gives an application program the impression that it has contiguous working memory (an address space), while in fact it may be physically fragmented and may even overflow on to disk storage technique make programming of large applications easier and use real physical memory (e.g. RAM) more efficiently Typical software executes in virtual memory Address space is typically 32 bit or 2 32 bytes or 4GB Transfer between memory and disk is in terms of blocks

13.2.1 Mechanism of Disk Mechanisms of Disks Use of secondary storage is one of the important characteristic of DBMS Consists of 2 moving pieces of a disk 1. disk assembly 2. head assembly Disk assembly consists of 1 or more platters Platters rotate around a central spindle

13.2.1 Mechanism of Disk Disk is organized into tracks The track that are at fixed radius from center form one cylinder Tracks are organized into sectors Tracks are the segments of circle separated by gap

13.2.2 Disk Controller One or more disks are controlled by disk controllers Disks controllers are capable of Controlling the mechanical actuator that moves the head assembly Selecting the sector from among all those in the cylinder at which heads are positioned Transferring bits between desired sector and main memory Possible buffering an entire track

13.2.3 Disk Access Characteristics Accessing (reading/writing) a block requires 3 steps Disk controller positions the head assembly at the cylinder containing the track on which the block is located. It is a seek time The disk controller waits while the first sector of the block moves under the head. This is a rotational latency All the sectors and the gaps between them pass the head, while disk controller reads or writes data in these sectors. This is a transfer time

13.3 Accelerating Access to Secondary Storage Secondary storage: Several approaches for more-efficiently accessing data in secondary storage: Place blocks that are together in the same cylinder. Divide the data among multiple disks. Mirror disks. Use disk-scheduling algorithms. Prefetch blocks into main memory. Scheduling Latency added delay in accessing data caused by a disk scheduling algorithm. Throughput the number of disk accesses per second that the system can accommodate.

13.3.1 The I/O Model of Computation The number of block accesses (Disk I/Os) is a good time approximation for the algorithm. Disk I/os proportional to time taken. Ex 13.3: You want to have an index on R to identify the block on which the desired tuple appears, but not where on the block it resides. For Megatron 747 (M747) example, it takes 11ms to read a 16k block. delay in searching for the desired tuple is negligible.

13.3.2 Organizing Data by Cylinders Ex 13.4: We request 1024 blocks of M747. If data is randomly distributed, average latency is 10.76ms by Ex 13.2, making total latency 11s. If all blocks are consecutively stored on 1 cylinder: 6.46ms + 8.33ms * 16 = 139ms (1 average seek)(time per rotation)(# rotations) First seek time and first rotational latency can never be neglected

13.3.3 Using Multiple Disks Number of disks is proportional to the factor by which performance is performance will increase by improved Striping distributing a relation across multiple disks following this pattern: Data on disk R 1 : R 1, R 1+n, R 1+2n, Data on disk R 2 : R 2, R 2+n, R 2+2n, Data on disk R n : R n, R n+n, R n+2n, Ex 13.5: We request 1024 blocks with n = 4. 6.46ms + (8.33ms * (16/4)) = 39.8ms (1 average seek)(time per rotation)(# rotations)

13.3.4 Mirroring Disks Mirroring Disks having 2 or more disks hold identical copy of data. Benefit 1: If n disks are mirrors of each other, the system can survive a crash by n-1 disks. Benefit 2: If we have n disks, read performance increases by a factor of n. Performance increases =>increasing efficiency

13.3.5 Disk Scheduling and the Elevator Problem Disk controller will run this algorithm to select which of several requests to process first. Pseudo code: requests[] // array of all non-processed data requests upon receiving new data request: requests[].add(new request) while(requests[] is not empty) move head to next location if(head is at data in requests[]) retrieves data removes data from requests[] if(head reaches end) reverses head direction

13.3.5 Disk Scheduling and the Elevator Problem (cont) Events: Head starting point Request data at 8000 Request data at 24000 Request data at 56000 Get data at 8000 Request data at 16000 Get data at 24000 Request data at 64000 Get data at 56000 Request Data at 40000 Get data at 64000 Get data at 40000 Get data at 16000 datatime Current time 0 4.3 Current time 10 Current time 13.6 Current time 20 Current time 26.9 Current time 30 Current time 34.2 Current time 45.5 Current time 56.8 8000 16000 24000 32000 40000 48000 56000 64000 datatime 8000..4.3 datatime 8000..4.3 24000..13.6 datatime 8000..4.3 24000..13.6 56000..26.9 datatime 8000..4.3 24000..13.6 56000..26.9 64000..34.2 datatime 8000..4.3 24000..13.6 56000..26.9 64000..34.2 40000..45.5 datatime 8000..4.3 24000..13.6 56000..26.9 64000..34.2 40000..45.5 16000..56.8

13.3.5 Disk Scheduling and the Elevator Problem (cont) datatime 8000..4.3 24000..13.6 56000..26.9 64000..34.2 40000..45.5 16000..56.8 datatime 8000..4.3 24000..13.6 56000..26.9 16000..42.2 64000..59.5 40000..70.8 Elevator Algorithm FIFO Algorithm

13.3.6 Prefetching and Large-Scale Buffering If at the application level, we can predict the order blocks will be requested, we can load them into main memory before they are needed. This reduces the cost and time taken

13.4.Disk Failures Intermittent Error: Read or write is unsuccessful. If we try to read the sector but the correct content of that sector is not delivered to the disk controller. Check for the good or bad sector. To check write is correct: Read is performed. Good sector and bad sector is known by the read operation. Checksums: Each sector has some additional bits, called the checksums. They are set on the depending on the values of the data bits stored in that sector. Probability of reading bad sector is less if we use checksums. For Odd parity: Odd number of 1s, add a parity bit 1. For Even parity: Even number of 1s, add a parity bit 0. So, number of 1s becomes always even.

Example: 1. Sequence : 01101000-> odd no of 1s parity bit: 1 -> 011010001 2. Sequence : 111011100->even no of 1s parity bit: 0 -> 111011100 Stable -Storage Writing Policy: To recover the disk failure known as Media Decay, in which if we overwrite a file, the new data is not read correctly. Sectors are paired and each pair is said to be X, having left and right copies as Xl and Xr respectively and check the parity bit of left and right by substituting spare sector of Xl and Xr until the good value is returned.

The term used for these strategies is RAID or Redundant Arrays of Independent Disks. Mirroring: Mirroring Scheme is referred as RAID level 1 protection against data loss scheme. In this scheme we mirror each disk. One of the disk is called as data disk and other redundant disk. In this case the only way data can be lost is if there is a second disk crash while the first crash is being repaired. Parity Blocks: RAID level 4 scheme uses only one redundant disk no matter how many data disks there are. In the redundant disk, the ith block consists of the parity checks for the ith blocks of all the data disks. It means, the jth bits of all the ith blocks of both data disks and redundant disks, must have an even number of 1s and redundant disk bit is used to make this condition true.

Failures: If out of Xl and Xr, one fails, it can be read form other, but in case both fails X is not readable, and its probability is very small Write Failure: During power outage, 1. While writing Xl, the Xr, will remain good and X can be read from Xr 2. After writing Xl, we can read X from Xl, as Xr may or may not have the correct copy of X. Recovery from Disk Crashes: To reduce the data loss by Dish crashes, schemes which involve redundancy, extending the idea of parity checks or duplicate sectors can be applied.

Parity Block Writing When we write a new block of a data disk, we need to change that block of the redundant disk as well. One approach to do this is to read all the disks and compute the module-2 sum and write to the redundant disk. But this approach requires n-1 reads of data, write a data block and write of redundant disk block. Total = n+1 disk I/Os RAID 5 RAID 4 is effective in preserving data unless there are two simultaneous disk crashes. Error-correcting codes theory known as Hamming code leads to the RAID level 6. By this strategy the two simultaneous crashes are correctable. The bits of disk 5 are the modulo-2 sum of the corresponding bits of disks 1, 2, and 3. The bits of disk 6 are the modulo-2 sum of the corresponding bits of disks 1, 2, and 4. The bits of disk 7 are the module2 sum of the corresponding bits of disks 1, 3, and 4 Coping With Multiple Disk Crashes Reading/Writing We may read data from any data disk normally. To write a block of some data disk, we compute the modulo-2 sum of the new and old versions of that block. These bits are then added, in a modulo-2 sum, to the corresponding blocks of all those redundant disks that have 1 in a row in which the written disk also has 1.

Whatever scheme we use for updating the disks, we need to read and write the redundant disk's block. If there are n data disks, then the number of disk writes to the redundant disk will be n times the average number of writes to any one data disk. However we do not have to treat one disk as the redundant disk and the others as data disks. Rather, we could treat each disk as the redundant disk for some of the blocks. This improvement is often called RAID level 5.

Arranging data on disk 13.5 Arranging data on disk Data elements are represented as records, which stores in consecutive bytes in same same disk block. Basic layout techniques of storing data : Fixed-Length Records Allocation criteria - data should start at word boundary. Fixed Length record header 1. A pointer to record schema. 2. The length of the record. 3. Timestamps to indicate last modified or last read.

Example CREATE TABLE employee( name CHAR(30) PRIMARY KEY, address VARCHAR(255), gender CHAR(1), birthdate DATE ); Data should start at word boundary and contain header and four fields name, address, gender and birthdate.

Packing Fixed-Length Records into Blocks Packing Fixed-Length Records into Blocks Records are stored in the form of blocks on the disk and they move into main memory when we need to update or access them. A block header is written first, and it is followed by series of blocks. Block header contains the following information: Links to one or more blocks that are part of a network of blocks. Links to one or more blocks that are part of a network of blocks. Information about the role played by this block in such a network. Information about the role played by this block in such a network. Information about the relation, the tuples in this block belong to. Information about the relation, the tuples in this block belong to.

A "directory" giving the offset of each record in the block. Time stamp(s) to indicate time of the block's last modification and/or access Along with the header we can pack as many record as we can in one block as shown in the figure and remaining space will be unused.

13.6 Representing Block and Record Addresses Address of a block and Record In Main Memory Address of the block is the virtual memory address of the first byte Address of the record within the block is the virtual memory address of the first byte of the record In Secondary Memory: sequence of bytes describe the location of the block in the overall system Sequence of Bytes describe the location of the block : the device Id for the disk, Cylinder number, etc.

Addresses in Client-Server Systems The addresses in address space are represented in two ways Physical Addresses: byte strings that determine the place within the secondary storage system where the record can be found. Logical Addresses: arbitrary string of bytes of some fixed length Physical Address bits are used to indicate: Host to which the storage is attached Identifier for the disk Number of the cylinder Number of the track Offset of the beginning of the record

Map Table relates logical addresses to physical addresses. LogicalPhysical Logical Address Physical Address

Logical and Structured Addresses Purpose of logical address? Gives more flexibility, when we Move the record around within the block Move the record to another block Gives us an option of deciding what to do when a record is deleted? Pointer Swizzling Having pointers is common in an object-relational database systems Important to learn about the management of pointers Every data item (block, record, etc.) has two addresses: database address: address on the disk memory address, if the item is in virtual memory

Translation Table: Maps database address to memory address All addressable items in the database have entries in the map table, while only those items currently in memory are mentioned in the translation table DbaddrMem-addr Database address Memory Address

Pointer consists of the following two fields Bit indicating the type of address Database or memory address Example 13.17 Disk Block 2 Block 1 Memory Swizzled Unswizzled Block 1

Example 13.7 Block 1 has a record with pointers to a second record on the same block and to a record on another block If Block 1 is copied to the memory The first pointer which points within Block 1 can be swizzled so it points directly to the memory address of the target record Since Block 2 is not in memory, we cannot swizzle the second pointer Three types of swizzling Automatic Swizzling As soon as block is brought into memory, swizzle all relevant pointers.

Swizzling on Demand Only swizzle a pointer if and when it is actually followed. No Swizzling Pointers are not swizzled they are accesses using the database address. Unswizzling When a block is moved from memory back to disk, all pointers must go back to database (disk) addresses Use translation table again Important to have an efficient data structure for the translation table

Pinned records and Blocks A block in memory is said to be pinned if it cannot be written back to disk safely. If block B1 has swizzled pointer to an item in block B2, then B2 is pinned Unpin a block, we must unswizzle any pointers to it Keep in the translation table the places in memory holding swizzled pointers to that item Unswizzle those pointers (use translation table to replace the memory addresses with database (disk) addresses

13.7 Records With Variable-Length Fields A simple but effective scheme is to put all fixed length fields ahead of the variable-length fields. We then place in the record header: 1. The length of the record. 2. Pointers to (i.e., offsets of) the beginnings of all the variable- length fields. However, if the variable-length fields always appear in the same order then the first of them needs no pointer; we know it immediately follows the fixed-length fields.

Records With Repeating Fields A similar situation occurs if a record contains a variable number of Occurrences of a field F, but the field itself is of fixed length. It is sufficient to group all occurrences of field F together and put in the record header a pointer to the first. We can locate all the occurrences of the field F as follows. Let the number of bytes devoted to one instance of field F be L. We then add to the offset for the field F all integer multiples of L, starting at 0, then L, 2L, 3L, and so on. Eventually, we reach the offset of the field following F. Where upon we stop.

An alternative representation is to keep the record of fixed length, and put the variable length portion - be it fields of variable length or fields that repeat an indefinite number of times - on a separate block. In the record itself we keep: 1. Pointers to the place where each repeating field begins, and 2. Either how many repetitions there are, or where the repetitions end.

Variable-Format Records The simplest representation of variable-format records is a sequence of tagged fields, each of which consists of: 1. Information about the role of this field, such as: (a) The attribute or field name, (b) The type of the field, if it is not apparent from the field name and some readily available schema information, and (c) The length of the field, if it is not apparent from the type. 2. The value of the field. There are at least two reasons why tagged fields would make sense.

1.Information integration applications - Sometimes, a relation has been constructed from several earlier sources, and these sources have different kinds of information For instance, our movie star information may have come from several sources, one of which records birthdates, some give addresses, others not, and so on. If there are not too many fields, we are probably best off leaving NULL those values we do not know. 2. Records with a very flexible schema - If many fields of a record can repeat and/or not appear at all, then even if we know the schema, tagged fields may be useful. For instance, medical records may contain information about many tests, but there are thousands of possible tests, and each patient has results for relatively few of them

1. Then we partition R i">

BASIS: If k = 1, i.e., one pass is allowed, then we must have B(R) < M. Put another way, s(M, 1) = Af. INDUCTION: Suppose k > 1. Then we partition R into 1M pieces, each of which must be sortable in k - 1 passes. If B(R) = s(M, k), then s(M, k)/:l17 which is the size of each of the M pieces of R, cannot exceed s(M, k - 1). That is: s(M, k) = Ms(M, k - 1) Performance of Multipass, Sort-Based Algorithms

Multipass Hash-Based Algorithms BASIS: For a unary operation, if the relation fits in hl buffers, read it into memory and perfor111 the operation. For a binary operation, if either relation fits in,11 - I buffers, perform the operation by reading this relation into main memory and then read the second relation, one block at a time, into the Mth buffer. INDUCTION: If no relation fits in main memory, then hash each relation into A 1 - 1 buckets, as discussed in Section 15.5.1. Recursively perform the operation on each bucket or corresponding pair of buckets, and accumulate the output from each bucket or pair.

The Query Compiler 16.1 Parsing and Preprocessing Meghna Jain(205) Dr. T. Y. Lin

Presentation Outline 16.1 Parsing and Preprocessing 16.1.1 Syntax Analysis and Parse Tree 16.1.2 A Grammar for Simple Subset of SQL 16.1.3 The Preprocessor 16.1.4 Processing Queries Involving Views

Query compilation is divided into three steps 1. Parsing: Parse SQL query into parser tree. 2. Logical query plan: Transforms parse tree into expression tree of relational algebra. 3.Physical query plan: Transforms logical query plan into physical query plan.. Operation performed. Order of operation. Algorithm used. The way in which stored data is obtained and passed from one operation to another.

Parser Preprocessor Logical Query plan generator Query rewrite Preferred logical query plan Query Form a query to a logical query plan

Syntax Analysis and Parse Tree Parser takes the sql query and convert it to parse tree. Nodes of parse tree: 1. Atoms: known as Lexical elements such as key words, constants, parentheses, operators such as +, < and other schema elements. 2. Syntactic categories: Subparts that plays a similar role in a query as,

Grammar for Simple Subset of SQL The syntactic category is intended to represent all well-formed queries of SQL. Some of its rules are: ::= ::= ::= ( ) Select-From-Where Forms lie give the syntactic category ::= SELECT FROM WHERE Select lists ::=, ::= From lists: ::=, ::=

Conditions ::= AND ::= IN ::= = ::= LIKE ::= Atoms(constants), (variable), ::= (can be expressed/defined as)

Query and Parse Tree StarsIn(title,year,starName) MovieStar(name,address,gender,birthdate) Query: Give titles of movies that have at least one star born in 1960 SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE '%1960%' );

Another query equivalent SELECT title FROM StarsIn, MovieStar WHERE starName = name AND birthdate LIKE '%1960%' ;

Parse Tree SELECT FROM WHERE, AND title StarsIn = LIKE starName name birthdate %1960 MovieStar

The Preprocessor Functions of Preprocessor. If a relation used in the query is virtual view then each use of this relation in the form-list must replace by parser tree that describe the view.. It is also responsible for semantic checking 1. Checks relation uses : Every relation mentioned in FROM- clause must be a relation or a view in current schema. For instance, the preprocessor applied to the parse tree 2. Check and resolve attribute uses: Every attribute mentioned in SELECT or WHERE clause must be an attribute of same relation in the current scope. For instance,attribute title in the first select-list. 3. Check types: All attributes must be of a type appropriate to their uses. Since birthdate is a date, and dates in SQL can normally be treated as strings, this use of an attribute is validated. Likewise, operators are checked to see that they apply to values of appropriate and compatible types.

StarsIn(title,year,starName) MovieStar(name,address,gender,birthdate) Query: Give titles of movies that have at least one star born in 1960 SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE '%1960%' );

Preprocessing Queries Involving Views When an operand in a query is a virtual view, the preprocessor needs to replace the operand by a piece of parse tree that represents how the view is constructed from base table. Base Table: Movies( title, year, length, genre, studioname, producerC#) View definition : CREATE VIEW ParamountMovies AS SELECT title, year FROM movies WHERE studioName = 'Paramount'; Example based on view: SELECT title FROM ParamountMovies WHERE year = 1979;

Thank You

Query Compiler By: Payal Gupta Roll No: 106(225) Professor: Tsau Young Lin

Pushing Selections It is, replacing the left side of one of the rules by its right side. In pushing selections we first a selection as far up the tree as it would go, and then push the selections down all possible branches.

Lets take an example: S t a r s I n ( t i t l e, year, starName) Movie(title, year, length, incolor, studioName, producerC#) Define view MoviesOf 1996 by: CREATE VIEW MoviesOfl996 AS SELECT * FROM Movie,WHERE year = 1996;

"which stars worked for which studios in 1996? can be given by a SQL Query: SELECT starName, studioName FROM MoviesOfl996 NATURAL JOIN StarsIn;

starName,studioName O Year=1996 StarsIn Movie Logical query plan constructed from definition of a query and view

starName,studioName O Year=1996 StarsIn Movie Year=1996 O Improving the query plan by moving selections up and down the tree

"pushing" projections really involves introducing a new projection somewhere below an existing projection. projection keeps the number of tuples the same and only reduces the length of tuples. To describe the transformations of extended projection Consider a term E + x on the list for a projection, where E is an attribute or an expression involving attributes and constants and x is an output attribute. Laws Involving Projection

Example Let R(a, b, c) and S(c, d, e) be two relations. Consider the expression x,+,,,, b+y(R w S). The input attributes of the projection are a,b, and e, and c is the only join attribute. We may apply the law for pushing projections below joins to get the equivalent expression: a+e->x,b->y(a,b,c(R) c,e(S))

Eliminating this projection and getting a third equivalent expression:a+e->x, b->y( R c,e(S)) In addition, we can perform a projection entirely before a bag union. That is: L(R UB S)= L(R) )UB L(S)

Laws About Joins and Products laws that follow directly from the definition of the join: R c S = c( R * S) R S = L( c ( R * S) ), where C is the condition that equates each pair of attributes from R and S with the same name. and L is a list that includes one attribute from each equated pair and all the other attributes of R and S. We identify a product followed by a selection as a join of some kind. O O

Laws Involving Duplicate Elimination The operator which eliminates duplicates from a bag can be pushed through many but not all operators. In general, moving a down the tree reduces the size of intermediate relations and may therefore beneficial. Moreover, sometimes we can move to a position where it can be eliminated altogether,because it is applied to a relation that is known not to possess duplicates.

(R)=R if R has no duplicates. Important cases of such a relation R include: a) A stored relation with a declared primary key, and b) A relation that is the result of a operation, since grouping creates a relation with no duplicates.

Several laws that "push" through other operators are: (R*S) =(R) * (S) (R S)=(R) (S) (R c S)=(R) c (S) ( c (R))= c ((R)) We can also move the to either or both of the arguments of an intersection: (R B S) = (R) B S = R B (S) = (R) B (S) OO

Laws Involving Grouping and Aggregation When we consider the operator , we find that the applicability of many transformations depends on the details of the aggregate operators used. Thus we cannot state laws in the generality that we used for the other operators. One exception is that a absorbs a . Precisely: ( L (R))= L (R)

let us call an operator duplicate-impervious if the only aggregations in L are MIN and/or MAX then: L(R) = L ((R)) provided L is duplicate- impervious.

Example Suppose we have the relations MovieStar(name, addr, gender, birthdate) StarsIn(movieTitle, movieyear, starname) and we want to know for each year the birthdate of the youngest star to appear in a movie that year. We can express this query as: SELECT movieyear, MAX(birth date) FROM MovieStar, StarsIn WHERE name = starName GROUP BY movieyear;

movieYear, MAX ( birthdate ) name = starName MovieStar StarsIn Initial logical query plan for the query O

Some transformations that we can apply to Fig are 1. Combine the selection and product into an equijoin. 2.Generate a below the , since the is duplicate- impervious. 3. Generate a between the and the introduced to project onto movie-Year and birthdate, the only attributes relevant to the

movieYear, MAX ( birthdate ) movieYear, birthdate name = starName MovieStar StarsIn Another query plan for the query

movieYear, MAX ( birthdate ) movieYear, birthdate name = starName birthdate,name movieYear,starname MovieStar StarsIn third query plan for Example

The Query Compiler Section 16.3 DATABASE SYSTEMS The Complete Book Presented By:Under the supervision of: Deepti KunduDr. T.Y.Lin

Review Query Preferred logical query plan Parser Preprocessor Logical query plan generator Query Rewriter Section 16.1 Section 16.3

Two steps to turn Parse tree into Preferred Logical Query Plan Replace the nodes and structures of the parse tree, in appropriate groups, by an operator or operators of relational algebra. Take the relational algebra expression and turn it into an expression that we expect can be converted to the most efficient physical query plan.

Reference Relations StarsIn (movieTitle, movieYear, starName) MovieStar (name, address, gender, birthdate) Conversion to Relational Algebra If we have a with a that has no subqueries, then we may replace the entire construct the select-list, from-list, and condition by a relational-algebra expression.

The relational-algebra expression consists of the following from bottom to top: The products of all the relations mentioned in the, which Is the argument of: A selection C, where C is the expression in the construct being replaced, which in turn is the argument of: A projection L, where L is the list of attributes in the Example: SELECT movieTitle FROM Starsin, MovieStar WHERE starName = name AND birthdate LIKE %1960;

SELECT movieTitle FROM Starsin, MovieStar WHERE starName = name AND birthdate LIKE %1960;

Translation to an algebraic expression tree

Removing Subqueries From Conditions For parse trees with a that has a subquery Intermediate operator two argument selection It is intermediate in between the syntactic categories of the parse tree and the relational- algebra operators that apply to relations.

Using a two-argument movieTitle StarsIn MovieStar IN name starName birthdate LIKE %1960'

Two argument selection with condition involving IN Now say we have, two arguments some relation and the second argument is a of the form t IN S. t tuple composed of some attributes of R S uncorrelated subquery Steps to be followed: 1.Replace the by the tree that is the expression for S ( is used to remove duplicates) 2.Replace the two-argument selection by a one-argument selection C. 3.Give C an argument that is the product of R and S.

Two argument selection with condition involving IN R tINS CC X R S

The effect

Improving the Logical Query Plan Algebraic laws to improve logical query plans: Selections can be pushed down the expression tree as far as they can go. Similarly, projections can be pushed down the tree, or new projections can be added. Duplicate eliminations can sometimes be removed, or moved to a more convenient position in the tree. Certain selections can be combined with a product below to turn the pair of operations into an equijoin.

Grouping Associative/ Commutative Operators An operator that is associative and commutative operators may be though of as having any number of operands. We need to reorder these operands so that the multiway join is executed as sequence of binary joins. Its more time consuming to execute them in the order suggested by parse tree. For each portion of subtree that consists of nodes with the same associative and commutative operator (natural join, union, and intersection), we group the nodes with these operators into a single node with many children.

The effect of query rewriting movieTitle Starname = name StarsIn birthdate LIKE %1960 MovieStar

Final step in producing logical query plan => U U U W R ST VU UV W R S T

An Example to summarize find movies where the average age of the stars was at most 40 when the movie was made SELECT distinct m1.movieTitle, m1,movieYear FROM StarsIn m1 WHERE m1.movieYear 40

Notation for Physical Query Plans (cont.) Example of a physical-query-plan A physical-query-plan in example 16.36 for the case k > 5000 TableScan Two-pass hash join Materialize (double line) Store operator 308

Notation for Physical Query Plans (cont.) Another example A physical-query-plan in example 16.36 for the case k < 49 TableScan (2) Two-pass hash join Pipelining Different buffers needs Store operator 309

Notation for Physical Query Plans (cont.) A physical-query-plan in example 16.35 Use Index on condition y = 2 first Filter with the rest condition later on. 310

VII. Ordering of Physical Operations The PQP is represented as a tree structure implied order of operations. Still, the order of evaluation of interior nodes may not always be clear. Iterators are used in pipeline manner Overlapped time of various nodes will make ordering no sense. 311

Ordering of Physical Operations (cont.) 3 rules summarize the ordering of events in a PQP tree: 1.Break the tree into sub-trees at each edge that represent materialization. Execute one subtree at a time. 2.Order the execution of the subtree Bottom-top Left-to-right 3.All nodes of each sub-tree are executed simultaneously. 312

Reference [1] H. Garcia-Molina, J. Ullman, and J. Widom, Database System: The Complete Book, second edition: p.897-913, Prentice Hall, New Jersey, 2008 313

Chapter:18 18.1 Serial and Serializable Schedule A process of assuming that the transactions preserve the consistency when executing simultaneously is called Concurrency Control. This consistency is taken care by Scheduler. Concurrency control in database management systems (DBMS) ensures that database transactions are performed concurrently without the concurrency violating the data integrity of a database. Executed transactions should follow the ACID rules. The DBMS must guarantee that only serializable (unless Serializability is intentionally relaxed), recoverable schedules are generated.

It also guarantees that no effect of committed transactions is lost, and no effect of aborted (rolled back) transactions remains in the related database. ACID rules ACID Atomicity - Either the effects of all or none of its operations remain when a transaction is completed - in other words, to the outside world the transaction appears to be indivisible, atomic. Atomicity Consistency - Every transaction must leave the database in a consistent state. Consistencytransaction consistent state Isolation - Transactions cannot interfere with each other. Providing isolation is the main goal of concurrency control. Isolation Durability - Successful transactions must persist through crashes. Durabilitycrashes

In the field of databases, a schedule is a list of actions, (i.e. reading, writing, aborting, committing), from a set of transactions. In this example, Schedule D is the set of 3 transactions T1, T2, T3. The schedule describes the actions of the transactions as seen by the DBMS. T1 Reads and writes to object X, and then T2 Reads and writes to object Y, and finally T3 Reads and writes to object Z. This is an example of a serial schedule, because the actions of the 3 transactions are not interleaved.

Serial and Serializable Schedules: A schedule that is equivalent to a serial schedule has the serializability property. In schedule E, the order in which the actions of the transactions are executed is not the same as in D, but in the end, E gives the same result as D.

Serial Schedule TI precedes T2 T1T2AB50 READ (A,t) t := t+100 WRITE (A,t)150 READ (A,s) s := s*2 WRITE (A,s)300 READ (B,t) t := t+100 WRITE (B,t)150 READ (B,s) s := s*2 WRITE (B,s)300

Non-Serializable Schedule T1T2AB50 READ (A,t) t := t+100 WRITE (A,t)150 READ (A,s) s := s*2 WRITE (A,s)300 READ (B,s) s := s*2 WRITE (B,s)100 READ (B,t) t := t+100 WRITE (B,t)200

A Serializable Schedule with details T1T2AB50 READ (A,t) t := t+100 WRITE (A,t)150 READ (A,s) s := s*1 WRITE (A,s)150 READ (B,s) s := s*1 WRITE (B,s)50 READ (B,t) t := t+100 WRITE (B,t)150

18.2 Conflict Serializability Non-Conflicting Actions Two actions are non-conflicting if whenever they occur consecutively in a schedule, swapping them does not affect the final state produced by the schedule. Otherwise, they are conflicting. Conflicting Actions: General Rules Two actions of the same transaction conflict: r1(A) w1(B) Two actions over the same database element conflict, if one of them is a write r1(A) w2(A) w1(A) w2(A)

Conflict Serializable: We may take any schedule and make as many nonconflicting swaps as we wish. With the goal of turning the schedule into a serial schedule. If we can do so, then the original schedule is serializable, because its effect on the database state remains the same as we perform each of the nonconflicting swaps. A schedule is said to be conflict-serializable when the schedule is conflict-equivalent to one or more serial schedules. Another definition for conflict-serializability is that a schedule is conflict-serializable if and only if there exists an acyclic precedence graph/serializability graph for the schedule. Which is conflict-equivalent to the serial schedule, but not.

Conflict equivalent / conflict-serializable Let Ai and Aj are consecutive non-conflicting actions that belongs to different transactions. We can swap Ai and Aj without changing the result. Two schedules are conflict equivalent if they can be turned one into the other by a sequence of non-conflicting swaps of adjacent actions. We shall call a schedule conflict-serializable if it is conflict- equivalent to a serial schedule Test for conflict-serializability Construct the precedence graph for S and observe if there are any cycles. If yes, then S is not conflict-serializable Else, it is a conflict-serializable schedule.

Example of a cyclic precedence graph: Consider the below schedule S 1 : r 2 (A); r 1 (B); w 2 (A); r 2 (B); r 3 (A); w 1 (B); w 3 (A); w 2 (B); Observing the actions of A in the previous example (figure 2), we can find that T 2

Expected exceptions 1. Suppose there is a transaction U, such that: U is in VAL or FIN; that is, U has validated, FIN(U)>START(T); that is, U did not finish before T started RS(T) WS(T) ; let it contain database element X. 2. Suppose there is transaction U, such that: U is in VAL; U has successfully validated. FIN(U)>VAL(T); U did not finish before T entered its validation phase. WS(T) WS(U) ; let x be in both write sets.

Validation rules Check that RS(T) WS(U)= for any previously validated U that did not finish before T has started i.e. FIN(U)>START(T). Check that WS(T) WS(U)= for any previously validated U that did not finish before T is validated i.e. FIN(U)>VAL(T)

Solution Validation of U: Nothing to check Validation of T: WS(U) RS(T)= {D} {A,B}= WS(U) WS(T)= {D} {A,C}= Validation of V: RS(V) WS(T)= {B}{A,C}= WS(V) WS(T)={D,E} {A,C}= RS(V) WS(U)={B} {D}= Validation of W: RS(W) WS(T)= {A,D}{A,C}={A} WS(W) WS(V)= {A,D}{D,E}={D} WS(W) WS(V)= {A,C}{D,E}=(W is not validated) wrapper for cars of a given color ($c): SELECT * FROM AutoMed WHERE color = $c; => SELECT serialNo,model,color,autoTrans,dealer1 FROM Cars WHERE color = $c; Wrapper Template describing queries for cars of a given color Templates needed: Pow (2,n) for n attributes For all possible queries from the mediator">

Template for Query Patterns Design a wrapper Build templates for all possible queries that the mediator can ask. Mediator schema: AutosMed (serialNo,model,color,autoTrans,dealer) Source schema: Cars (serialNo,model,color,autoTrans,navi,) Mediator -> wrapper for cars of a given color ($c): SELECT * FROM AutoMed WHERE color = $c; => SELECT serialNo,model,color,autoTrans,dealer1 FROM Cars WHERE color = $c; Wrapper Template describing queries for cars of a given color Templates needed: Pow (2,n) for n attributes For all possible queries from the mediator

Wrapper Generators The software that creates the wrapper is Wrapper Generator. Driver Table Source Templates Wrapper Generator Queries Results Wrapper

Wrapper Generators Wrapper Generator: Creates a table that holds the various query patterns contained in templates. Source queries associated with each of them. The Driver: Accept a query from the mediator. Search the table for a template that matches the query. Send the query to the source. Return the response to the Mediator.

Filters Consider the Car dealers database. The Wrapper template to get the cars of a given model and color is: SELECT * FROM AutoMed WHERE model = $m and color = $c; => SELECT serialNo,model,color,autoTrans,dealer1 FROM Cars WHERE model = $m and color = $c; Another approach is to have a Wrapper Filter: The Wrapper has a template that returns a superset of what the query wants. Filter the returned tuples at the Wrapper and pass only the desired tuples. Position of the Filter Component: At the Wrapper At the Mediator

Filters To find the blue cars of model Ford: Use the template to extract the blue cars. Return the tuples to the Mediator. Filter to get the Ford model cars at the Mediator. Store at the temporary relation: TempAutos (serialNo,model,color,autoTrans,dealer) Filter by executing a local query: SELECT * FROM TempAutos WHERE model = FORD;

Other Operations at the Wrapper It is possible to take the joins at the Wrapper and transmit the result to Mediator. Suppose the Mediator is asked to find dealers and models such that the dealer has two red cars, of the same model, one with and one without automatic transmission: SELECT A1.model, A1.dealer FROM AutosMed A1, AutosMed A2 WHERE A1.model = A2.model AND A1.color = red AND A2.color = red AND A1.autoTrans = no and A2.autoTrans = yes; Wrapper can first obtain all the red cars: SELECT * FROM AutosMed WHERE color = red; RedAutos (serialNo,model,color,autoTrans,dealer)

Other Operations at the Wrapper The Wrapper then performs a join and the necessary selection. SELECT DISTINCT A1.model, A1.dealer FROM RedAutos A1, RedAutos A2 WHERE A1.model = A2.model AND A1.autoTrans = no AND A2.autoTrans = yes;

Thank You

Sections 21.4 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 Dr. TY Lin INFORMATION INTEGRATION

21.4 Capability Based Optimization Introduction Typical DBMS estimates the cost of each query plan and picks what it believes to be the best Mediator has knowledge of how long its sources will take to answer Optimization of mediator queries cannot rely on cost measure alone to select a query plan Optimization by mediator follows capability based optimization

21.4.1 The Problem of Limited Source Capabilities Many sources have only Web Based interfaces Web sources usually allow querying through a query form E.g. Amazon.com interface allows us to query about books in many different ways. General questions like: Select * from books should not be asked

21.4.1 The Problem of Limited Source Capabilities (cont) Reasons why a source may limit the ways in which queries can be asked Earliest database did not use relational DBMS that supports SQL queries Indexes on large database may make certain queries feasible, while others are too expensive to execute Security reasons E.g. CS Dept answer queries about average salary, but wont disclose details of a particular professors information

21.4.2 A Notation for Describing Source Capabilities For relational data, legal forms of queries are described by adornments Adornments Sequences of codes that represent the requirements for the attributes of the relation, in their standard order f(free) attribute can be specified or not b(bound) must specify a value for an attribute but any value is allowed u(unspecified) not permitted to specify a value for a attribute

21.4.2 A notation for Describing Source Capabilities.(contd) c[S](choice from set S) means that a value must be specified and value must be from finite set S. o[S](optional from set S) means either do not specify a value or we specify a value from finite set S A prime (f) specifies that an attribute is not a part of the output of the query The specification of capabilities is a set of adornments as defined above A query must match one of the adornments in its capabilities specification

21.4.2 A notation for Describing Source Capabilities.(contd) E.g. Dealer 1 is a source of data in the form: Cars (serialNo, model, color, autoTrans, navi) The adornment for this query form is buuuu

21.4.3 Capability-Based Query-Plan Selection Given a query at the mediator, a capability based query optimizer first considers what queries it can ask at the sources to help answer the query The process is repeated until: Enough queries are asked at the sources to resolve all the conditions of the mediator query and therefore query is answered. Such a plan is called feasible. no more valid forms of source queries can be constructed, yet still cannot answer the mediator query.

21.4.3 Capability-Based Query-Plan Selection (contd) The simplest form of mediator query where we need to apply the above strategy is join relations E.g. we have sources for dealer 2 Autos (serial, model, color) Options (serial, option) Suppose that ubf is the sole adornment for Auto and Options have two adornments, bu and uc[autoTrans, navi] Query is to find serial numbers and colors of McLaren models with a navigation system

21.4.4 Adding Cost-Based Optimization Mediators Query optimizer is not done when the capabilities of the sources are examined Query optimizer chooses among the available feasible plans Making an intelligent, cost based query optimization requires that the mediator knows a great deal about the costs of queries involved Sources are independent of the mediator, so it is difficult to estimate the cost

21.5 Optimizing Mediator Queries Chain algorithm a greedy algorithm answers the query by sending a sequence of requests to its sources. Always finds a solution if it exists The solution may not be optimal.

21.5.1 Simplified Adornment Notation A query at the mediator is limited to b (bound) and f (free) adornments. We use the following convention for describing adornments: name adornments (attributes) where: name is the name of the relation No. of adornments = no. of attributes

21.5.2 Obtaining Answers for Subgoals Rules for subgoals and sources: Suppose we have the following subgoal: R x 1 x 2 x n (a 1, a 2, , a n ), and source adornments for R are: y 1 y 2 y n. If y i is b or c[S], then x i = b. If x i = f, then y i is not output restricted. The adornment on the subgoal matches the adornment at the source: If yi is f, u, or o[S] and xi is either b or f.

21.5.3 The Chain Algorithm Maintains 2 types of information: An adornment for each subgoal. A relation X that is the join of the relations for all the subgoals that have been resolved. The adornment for a subgoal is b if the mediator query provides a constant binding for the corresponding argument of that subgoal. X is a relation over no attributes, containing just an empty tuple.

21.5.3 The Chain Algorithm (cont) First, initialize adornments of subgoals and X. Then, repeatedly select a subgoal that can be resolved. Let R (a 1, a 2, , a n ) be the subgoal: Wherever has a b, we shall find the argument in R is a constant, or a variable in the schema of R. Project X onto its variables that appear in R.

21.5.3 The Chain Algorithm (cont) 2.For each tuple t in the project of X, issue a query to the source as follows ( is a source adornment). If has b, then the corresponding component of has b, and we can use the corresponding component of t for source query. If ha s c[S], and the corresponding component of t is in S, then the corresponding component of has b, and we can use the corresponding component of t for the source query. If has f, and the corresponding component of is b, provide a constant value for source query.

21.5.3 The Chain Algorithm (cont) If a component of is u, then provide no binding for this component in the source query. If a component of is o[S], and the corresponding component of is f, then treat it as if it was a f. If a component of is o[S], and the corresponding component of is b, then treat it as if it was c[S]. 3.Every variable among a 1, a 2, , a n is now bound.

21.5.3 The Chain Algorithm (cont) 4.Replace X with X s(R), where S is all of the variables among: a 1, a 2, , a n. 5.Project out of X all components that correspond to variables that do not appear in the head or in any unresolved subgoal. If every subgoal is resolved, then X is the answer.

21.5.3 The Chain Algorithm Example Mediator query: Q: Answer(c) Rbf(1,a) AND Sff(a,b) AND Tff(b,c) Example: Relation R S T Data Adornment bfc[2,3,5]f bu wx 12 13 14 xy 24 35 yz 46 57 58

21.5.3 The Chain Algorithm Example (cont) Initially, the adornments on the subgoals are the same as Q, and X contains an empty tuple. S and T cannot be resolved as they each have ff adornments, but the sources have either a, b or c. R(1,a) can be resolved because its adornments are matched by the sources adornments. Send R(w,x) with w=1 to get the tables on the previous page.

21.5.3 The Chain Algorithm Example (cont) Project the subgoals relation onto its second component, since only the second component of R(1,a) is a variable. This is joined with X, resulting in X equaling this relation. Change adornment on S from ff to bf. a 2 3 4

21.5.3 The Chain Algorithm Example (cont) Now we resolve S bf (a,b): Project X onto a, resulting in X. Now, search S for tuples with attribute a equivalent to attribute a in X. Join this relation with X, and remove a as it doesnt appear in the head nor any unresolved subgoal: aB 24 35 b 4 5

21.5.3 The Chain Algorithm Example (cont) Now we resolve T bf (b,c): Join this relation with X and project onto the c attribute to get the relation for the head. Solution is {(6), (7), (8)}. bc 46 57 58

21.5.4 Incorporating Union Views at the Mediator This implementation of the Chain Algorithm does not consider that several sources can contribute tuples to a relation. If specific sources have tuples to contribute that other sources may not have, it adds complexity. To resolve this, we can consult all sources, or make best efforts to return all the answers.

21.5.4 Incorporating Union Views at the Mediator (cont) Consulting All Sources A subgoal is resolved when each source for its relation has an adornment matched by current adornment of subgoal. Less practical because it makes queries harder to answer and impossible if any source is down. Best Efforts We need only 1 source with a matching adornment to resolve a subgoal. Need to modify chain algorithm to revisit each subgoal when that subgoal has new bound requirements.

Local-as-View Mediators Priya Gangaraju(Class Id:203)

Local-as-View Mediators. In a LAV mediator, global predicates defined are not views of the source data. Expressions are defined for each source with global predicates that describe tuples that source produces Mediator answers the queries by constructing the views as provided by the source.

Motivation for LAV Mediators Relationship between the data provided by the mediator and the sources is more subtle For example, consider the predicate Par(c, p) meaning that p is a parent of c which represents the set of all child parent facts that could ever exist. The sources will provide information about whatever child-parent facts they know.

Motivation(contd..) There can be sources which may provide child-grandparent facts but not child- parent facts at all. This source can never be used to answer the child-parent query under GAV mediators. LAV mediators allow to say that a certain source provides grand parent facts. Used to discover how and when to use the source in a given query.

Terminology for LAV Mediation. The queries at mediator and those describing the source will be single Datalog rules A single Datalog rule is called a conjunctive query The global predicates of LAV mediator are used as subgoals of mediator queries. Conjunctive queries define views. Their heads each have a unique view predicate that is name of a view. Each view definition consists of global predicates and is associated with a particular source. Each view is constructed with an all-free adornment.

Example.. Consider global predicate Par(c, p) meaning that p is a parent of c. One source produces parent facts. Its view is defined by the conjunctive query- V 1 (c, p) Par(c, p) Another source produces some grand parents facts. Then its conjunctive query will be V 2 (c, g) Par(c, p) AND Par(p, g)

Example contd.. The query at mediator will ask for great-grand parent facts to be obtained from sources: Q(w, z) Par(w, x) AND Par(x, y) AND Par(y, z) One solution can be using the parent predicate(V 1 ) directly three times. Q(w, z) V 1 (w, x) AND V 1 (x, y) AND V 1 (y, z) Another solution can be to use V 1 (parent facts) and V 2 (grandparent facts). Q(w, z) V 1 (w, x) AND V 2 (x, z) Or Q(w, z) V 2 (w, y) AND V 1 (y, z)

Expanding Solutions. Consider a query Q, a solution S that has a body whose subgoals are views and each view V is defined by a conjunctive query with that view as the head. The body of Vs conjunctive query can be substituted for a subgoal in S that uses the predicate V to have a body consisting of only global predicates.

Expansion Algorithm A solution S has a subgoal V(a 1, a 2,an) where a i s can be any variables or constants. The view V can be of the form V(b 1, b 2,.b n ) B Where B represents the entire body. V(a 1, a 2, a n ) can be replaced in solution S by a version of body B that has all the subgoals of B with variables possibly altered.

Expansion Algorithm contd.. The rules for altering the variables of B are: 1.First identify the local variables B, variables that appear in the body but not in the head. 2.If there are any local variables of B that appear in B or in S, replace each one by a distinct new variable that appears nowhere in the rule for V or in S. 3.In the body B, replace each b i by a i for i = 1,2n.

Example. Consider the view definitions, V 1 (c, p) Par(c, p) V 2 (c, g) Par(c, p) AND Par(p, g) One of the proposed solutions S is Q(w, z) V 1 (w, x) AND V 2 (x, z) The first subgoal with predicate V 1 in the solution can be expanded as Par(w, x) as there are no local variables.

Example Contd. The V2 subgoal has a local variable p which doesnt appear in S nor it has been used as a local variable in another substitution. So p can be left as it is. Only x and z are to be substituted for variables c and g. The Solution S now will be Q(w, z) Par(w, x) AND Par(x, p) AND Par(p,z)

Containment of Conjunctive Queries A containment mapping from Q to E is a function from the variables of Q to the variables and constants of E, such that: 1.If x is the ith argument of the head of Q, then (x) is the ith argument of the head of E. 2.Add to the rule that (c)=c for any constant c. If P(x 1,x 2, x n ) is a subgoal of Q, then P((x 1 ), (x 2 ), (x n )) is a subgoal of E.

Example. Consider two Conjunctive queries: Q 1 : H(x, y) A(x, z) and B(z, y) Q 2 : H(a, b) A(a, c) AND B(d, b) AND A(a, d) When we apply the substitution, (x) = a, (y) = b, (z) = d, the head of Q 1 becomes H(a, b) which is the head of Q 2. So,there is a containment mapping from Q 1 to Q 2.

Example contd.. The first subgoal of Q 1 becomes A(a, d) which is the third subgoal of Q 2. The second subgoal of Q 1 becomes the second subgoal of Q 2. There is also a containment mapping from Q 2 to Q 1 so the two conjunctive queries are equivalent.

Why the Containment-Mapping Test Works Suppose there is a containment mapping from Q 1 to Q 2. When Q 2 is applied to the database, we look for substitutions for all the variables of Q 2. The substitution for the head becomes a tuple t that is returned by Q 2. If we compose and then , we have a mapping from the variables of Q 1 to tuples of the database that produces the same tuple t for the head of Q 1.

Finding Solutions to a Mediator Query There can be infinite number of solutions built from the views using any number of subgoals and variables. LMSS Theorem can limit the search which states that If a query Q has n subgoals, then any answer produced by any solution is also produced by a solution that has at most n subgoals. If the conjunctive query that defines a view V has in its body a predicate P that doesnt appear in the body of the mediator query, then we need not consider any solution that uses V.

Example. Recall the query Q1: Q(w, z) Par(w, x) AND Par(x, y) AND Par(y, z) This query has three subgoals, so we dont have to look at solutions with more than three subgoals.

Why the LMSS Theorem Holds Suppose we have a query Q with n subgoals and there is a solution S with more than n subgoals. The expansion E of S must be contained in Query Q, which means that there is a containment mapping from Q to E. We remove from S all subgoals whose expansion was not the target of one of Qs subgoals under the containment mapping.

Contd.. We would have a new conjunctive query S with at most n subgoals. If E is the expansion of S then, E is a subset of Q. S is a subset of S as there is an identity mapping. Thus S need not be among the solutions to query Q.

Information Integration Entity Resolution 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103

Contents 21.7 Entity Resolution 21.7.1 Deciding Whether Records Represent a Common Entity 21.7.2 Merging Similar Records 21.7.3 Useful Properties of Similarity and Merge Functions 21.7.4 The R-Swoosh Algorithm for ICAR Records 21.7.5 Other Approaches to Entity Resolution

Introduction ENTITY RESOLUTION: Entity resolution is a problem that arises in many information integration scenarios. It refers to determining whether two records or tuples do or do not represent the same person, organization, place or other entity.

Deciding whether Records represent a Common Entity Two records represent the same individual if the two records have similar values for each of the fields associated with those records. It is not sufficient that the values of corresponding fields be identical because of following reasons: 1. Misspellings 2. Variant Names 3. Misunderstanding of Names

Continue: Deciding whether Records represent a Common Entity 4. Evolution of Values 5. Abbreviations Thus when deciding whether two records represent the same entity, we need to look carefully at the kinds of discrepancies and use the test that measures the similarity of records.

Deciding Whether Records Represents a Common Entity - Edit Distance First approach to measure the similarity of records is Edit Distance. Values that are strings can be compared by counting the number of insertions and deletions of characters it takes to turn one string into another. So the records represent the same entity if their similarity measure is below a given threshold.

Deciding Whether Records Represents a Common Entity - Normalization To normalize records by replacing certain substrings by others. For instance: we can use the table of abbreviations and replace abbreviations by what they normally stand for. Once normalize we can use the edit distance to measure the difference between normalized values in the fields.

Merging Similar Records Merging refers to removal of redundant data in two records. There are many merge rules: 1. Set the field in which the records disagree to the empty string. 2. (i) Merge by taking the union of the values in each field (ii) Declare two records similar if at least two of the three fields have a nonempty intersection.

Continue: Merging Similar Records Name Address Phone 1. Susan 123 Oak St. 818-555-1234 2. Susan 456 Maple St. 818-555-1234 3. Susan 456 Maple St. 213-555-5678 After Merging Name Address Phone (1-2-3) Susan {123 Oak St.,456 Maple St} {818-555-1234, 213- 555-5678}

Useful Properties of Similarity and Merge Functions The following properties say that merge operation is a semi lattice: 1.Idempotence: Merge of a record with itself yeilds the same record. 2.Commutativity: Order of merged records does not matter 3.Associativity : The order in which we group records for a merger should not matter.

Continue: Useful Properties of Similarity and Merge Functions There are some other properties that we expect similarity relationship to have: Idempotence for similarity: A record is always similar to itself Commutativity of similarity: In deciding whether two records are similar it does not matter in which order we list them Representability: If r is similar to some other record s, but s is instead merged with some other record t, then r remains similar to the merger of s and t and can be merged with that record.

R-swoosh Algorithm for ICAR Records Input: A set of records I, similarity function and a merge function. Output: A set of merged records O. Method: O:= emptyset; WHILE I is not empty DO BEGIN Let r be any record in I; Find, if possible, some record s in O that is similar to r; IF no record s exists THEN move r from I to O ELSE BEGIN delete r from I; delete s from O; add the merger of r and s to I; END;

Other Approaches to Entity Resolution The other approaches to entity resolution are : Non- ICAR Datasets Clustering Partitioning

Other Approaches to Entity Resolution - Non ICAR Datasets Non ICAR Datasets : We can define a dominance relation r

Documents

CS 257: Database System Principles Final Exam by Ronak Shah (214) SJSU ID: 006260709 under guidance of Dr. Lin