CS222/122C – Fall 2019, Final Exam with Solutions

NAME: _______________________ UCI ID: _______________________

CS222/122C – Fall 2019, Final Exam with Solutions Principles of Data Management

Department of Computer Science, UC Irvine Prof. Chen Li

(Total Points: 120) Instructions:

● IMPORTANT: Fill your full name and UCI ID (eight digits) on each page.

● IMPORTANT: Do NOT write your answers on the back of the pages, as only one side will be scanned.

● This exam contains eight (8) questions, total 120 points.

● This exam is closed book. However, you can use one cheat sheet (A4 size).

● The total time for the exam is 120 minutes, so budget your time accordingly.

● Be sure to explain your answer for each question, show your work.

● If you don’t understand something, ask for clarification.

● If you still find ambiguities in a question, write down the interpretation you are

taking and then answer the question based on that interpretation.

QUESTION TOPIC POINTS

1 Short Questions 18

2 Cost Estimation 12

3 Record Manager 14

4 B+ Tree 16

5 Linear Hashing 12

6 External Sort 10

7 Join 24

8 Query Optimization 14

TOTAL 120

1

NAME: _______________________ UCI ID: _______________________

Question 1 (18 points): Short questions 1.1 (3 pts) When we analyze the cost of a database operation, e.g., the time of

inserting a record to a heap file, why do we care about the type of each disk I/O (random versus sequential)?

The seek time is the dominant cost of a disk IO cost. In sequential disk IO, we only have to do one seek and thus we pay the cost only once. While in random disk IO, we have to seek every time we access a new page. Thus, the cost of doing a sequential IO is much cheaper than doing a random IO. 1.2 (3 pts) Explain the meaning of pin/unpin operations in a buffer manager. PIN: when a process needs to consume a frame in the buffer, it needs to pin the frame to tell the manager, which will increment the count by 1. After the process is done with using the data, it calls UNPIN to tell the manager, which will decrement the count by 1. The manager can replace this page only if the count is 0, meaning no process is accessing this page. 1.3 (6 pts) Consider a table orders (orderId INTEGER, productid INTEGER, customerId INTEGER,

time TIMESTAMP, storeId INTEGER, price DOUBLE) with information about what customers bought what products at what stores at what time. Assume the table is large with 300 million records in it. Consider the following two queries:

Q1: SELECT * FROM orders WHERE orderId = 12423; Q2: SELECT SUM(price) FROM orders GROUP BY storeId;

1.3.1 (2 pts) What is the type (OLAP or OLTP) of each query? Q1: ___OLTP_____ Q2: _____OLAP____ 1.3.2 (4 pts) What store format (row store or column store) would better fit each query’s

purpose? Briefly explain your choice.

Q1: ____Row store____ Q2: ___Column store_____

Explanation: Explanation:

2

NAME: _______________________ UCI ID: _______________________

they tend to access a small number of records with many columns

they tend to access a few columns of many records, and column store can be IO efficient by doing compression

1.4 (3 pts) In Project 2, an extra-credit requirement is to implement the following

functions without touching the existing records in the table. RC dropAttribute(…); RC addAttribute(…);

Briefly describe a way to implement them (even if you didn’t implement it). Schema versioning or special tag – maintain the schema version of columns and tables in the system catalogs and keep the version in the record itself. When fetching a record, properly integrate the current version and the version in the record to return correct information.

1.5 (3 pts) Describe three ways to manage the free space of pages within a record file.

1) Double linked list. Maintain a linked list of page descriptors for free pages,

and also a linked list of page descriptors for full pages. 2) Bitmap. Keep a bitmap, which uses a bit for each page (for example, 0

means full, 1 means to have free space). 3) Directory. Keep a directory to store space info and page pointers.

If you mention storing free space on each page and do a sequential scan, it

is also accepted. Question 2 (12 points): Cost Estimation 2 Consider a table with the following schema:

car_sales (make VARCHAR(20), model VARCHAR(30), price DOUBLE). It has 300 records with the following distribution of 'price' values.

3

NAME: _______________________ UCI ID: _______________________

2.1 (3 pts) Draw a 5-bucket equi-width histogram.

2.2 (3 pts) Consider the following query:

SELECT * FROM car_sales WHERE price <= X; What X value can make 98 as its estimated number of records based on the equi-width histogram in 2.1)? You can leave your answer as a formula. 35 + 34 + (X-12)/(18-12)*58 = 98 X = 15 2.3 (3 pts) Draw a 4-bucket equi-height histogram.

4

NAME: _______________________ UCI ID: _______________________

2.4 (3 pts) Consider the following query:

SELECT * FROM car_sales WHERE price BETWEEN 18 AND 28; What is the estimated number of results for this query based on the equi-height histogram in 2.3)? You can leave your answer as a formula.

(20-18)/(20-14) * 75 + 75 + (28-24)/(30-24) *75 = 150 Question 3 (14 points): Record Manager 3 Consider a table with the following schema:

students (sid INTEGER, name VARCHAR(10), email VARCHAR(20)). We store it as a heap file of variable-length records. 3.1 (5 pts) For the following record, fill in the provided diagram to explain how it is

stored as a sequence of bytes using a directory of pointers. Explain how the 'NULL' value is represented. Assumptions:

● Use 2 bytes for field offsets of ending positions; ● We know the record schema, no need to store the number of fields; ● An “INTEGER” value takes 4 bytes.

students (33, NULL, '[email protected]').

9 -9/-1 25 33 a n t e a t e r @ u

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 c i . e d u 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

1. We asked for using a directory of pointers. using null indicators to manage

NULL is not accepted. 2. For the second offset, both -9 and -1 are accepted. It cannot be 0 or 9 as

this way won’t support VARCHAR with length 0.

5

NAME: _______________________ UCI ID: _______________________

3.2 (6 pts) Suppose we have a page designed as following: ● Page size = 4,096 bytes; ● Each record offset takes 2 bytes; ● Each record length takes 2 bytes; ● The size of free space takes 4 bytes; ● Number of slots takes 4 bytes; ● Slots are inserted from right to left.

Start with an empty page. Fill the missing values in the following diagram after all the following operations are done, including the number of free-space bytes, number of slots, each record’s name (“R1”, “R2”, or “R3”), its offset and length.

1. Insert record R1, length 80 bytes; 2. Insert record R2, length 60 bytes; 3. Insert record R3, length 125 bytes. 4. Delete record R2, assume the page is compacted immediately.

**record name R1 is marked for you as an example.**

Briefly explain your calculation: Free Space = 4096 - 80 - 125 - (2+2) * 3 - 4 - 4 = 3871 Second slot’s offset could be -1, or a number bigger than 4096. Its length could be an arbitrary number. 3.3 (3 pts) Briefly describe what information tombstones contain and how tombstones

are used in the record manager in our projects. Tombstone contains a page number and a slot number (or offset). It points to a slot in another page. When updating a record, if the record is larger than the original record, we need to leave a tombstone to point to the new location.

6

NAME: _______________________ UCI ID: _______________________

Question 4 (16 points): B+ Tree 4 Consider the following unclustered B+ tree index of order d = 2 on the “price” field of a relation tickets (ticketId INTEGER, price INTEGER).

4.1 (4 pts) Assuming different records are in different disk pages, calculate the number of disk I/Os if we use this index to answer the following query.

SELECT * FROM tickets WHERE price 43 AND price 80;≥ ≤

Index Search: root, A, F = 3 Reads Index Scan: G, H, I = 3 Reads Record Fetch = 49, 51, 55, 62, 68 = 5 Reads Total 11 Reads 4.2 (12 pts) For the following questions, draw the updated B+ tree after each

operation. If there is no ambiguity, you can just draw the part that is changed. For each operation, clearly add an “R” (for read), “W” (for write, including append), on all the affected pages, as well as the original Root/A/B/C/.../J page labels.

Assume: 1) Each node can hold up to 4 key entries. 2) For delete operations, the right sibling is checked for possible redistribution. 3) For insert operations, you need to do split, not rotation.

7

NAME: _______________________ UCI ID: _______________________

** Original tree copied here for your convenience. **

4.2.1 (4 pts) Insert a data entry with key 85 on the original tree.

4.2.2 (4 pts) Insert a data entry with key 17 on the original tree.

4.2.3 (4 pts) Delete the data entry with key 20 from the original tree.

Question 5 (12 points): Linear Hashing 5 Consider a linear hashing index shown in the following figure.

8

NAME: _______________________ UCI ID: _______________________

Assume:

1) h0(key) = last two digits of “key”, h1 (key) = last three digits of “key”. 2) A bucket split occurs whenever an overflow page is created.

5.1 (4 pts) Insert entry 22 on the original index. Draw the updated index.

5.2 (4 pts) Insert entry 63 on the original index. Draw the updated index.

9

NAME: _______________________ UCI ID: _______________________

5.3 (4 pts, TRICKY!) What is the minimum number of entry insertions on the original index that will cause a split of all four buckets? What is the value of “Next” after making these insertions? Draw an example final index and explain your answer briefly. Minimum number: 7 Next is pointing to the first bucket again. Each time keep inserting to a bucket that has minimal available slots, to tigger faster split. And choose the number that will not be redistributed after split, so that such bucket will remain nearly full after split. Example: insert 67 to trigger split on bucket 1. insert 33, 41 to trigger split on bucket 2. insert 49 to trigger split on bucket 3. insert 24, 32, 40 to trigger split on bucket 4. Final index:

10

NAME: _______________________ UCI ID: _______________________

Question 6 (10 points): External Sort 6 Suppose we have N = 27,900 pages of fixed-length records in a heap file. We have B = 31 available pages in memory to sort the file using the external sort algorithm covered in the lectures. 6.1 (3 pts) For each pass (including pass 0 of generating the initial runs), write down

the number of sorted runs and the size of a single run.

pass 0: 27900/31 = 900 runs with 31 pages. pass 1: 900/(31-1) = 30 runs with 30 * 31 = 930 pages. pass 2: 30/30 = 1 run with 30 * 30 * 31 = 27900 pages. Merge finished.

6.2 (3 pts) Suppose the number of initial runs is X. Write down a formula to calculate

the number of I/Os required to sort the entire file, excluding the writes in the last pass.

(2 * ceiling(LogB-1(X)) + 1) N

6.3 (4 pts) When generating the initial run in pass 0, we implicitly assumed that the

records are fixed length, so that we can do in-place swap of two records in memory. However, for variable-length records, we cannot do in-place swap since different records have different lengths. Describe a method to sort variable-length records in memory efficiently.

11

NAME: _______________________ UCI ID: _______________________

You need to allocate some space in memory to store a directory of pointers pointing to each record. Then when you sort the variable-length records, you swap the pointers instead of the actual record. If the directory is small, then you need at least one buffer page. If there are too many records and one buffer page cannot hold that many pointers, you need more buffer pages as needed.

Question 7 (24 points): Join 7 Suppose we have the following two tables.

● customers (cid, name, age, ...) ○ cid is the primary key; ○ 4,000 records (total), 300 pages; ○ An unclustered B+ tree on 'cid';

● orders (date, cid, price, ...) ○ 25,000 records (total), 500 pages; ○ 'cid' is a foreign key to 'customers.cid'; ○ An unclustered B+ tree on 'cid'; ○ The 'cid' values are uniformly distributed.

We want to join the tables on their 'cid' attributes, i.e., customers.cid = orders.cid. 7.1 (4 pts) Estimate the I/O cost of the Index Nested Loop Join, by scanning the

“customers” heap file as the outer table and using the B+ tree of 'orders.cid'. Assume all non-leaf pages of the B+ tree are cached in memory.

300 + 4000 * (1 + ceiling(25000/4000))

7.2 (4 pts) Estimate the I/O cost of the Block Nested Loop Join, by scanning the

“customers” heap file as the outer table. Assume the block size B = 30 pages.

300 + 300/30 * 500 = 5300 7.3 (6 pts) Consider the case where we want to use Grace Hash Join. Assume we

have B = 16 in-memory buffer pages. We treat the smaller table as the outer table to build the in-memory hashtable. Calculate the number of disk I/Os in the partitioning/building phase and the number of disk I/Os in the probing phase, excluding the final writes.

partitioning/building phase:

needs two passes as B-1 = 15 and 15 * 15 < 300 < 500

for customers:

12

NAME: _______________________ UCI ID: _______________________

pass 0: 15 partitions with 20 pages each. Each page is read and written

once. pass 1:

each partition is read back and repartition to 15 sub-partitions, each sub-partition will be ceiling(20 / 15) = 2 pages. Total 300 page reads and 15 * 15 * 2 = 450 writes

for orders: pass 0:

15 partitions with ceiling(500/15) = 34 pages each. Each page is read and written once.

pass 1: each partition is read back and repartition to 15 sub-partitions,

each sub-partition will be ceiling(34/15) = 3 pages. Total 500 page reads and 15 * 15 * 3 = 675 writes.

Probing phase:

Read costumers’s and orders’s corresponding partition back. Total 450 + 675 reads.

Total: 300 * 3 + 450 * 2 + 500 * 3 + 675 *2 = 4650

7.4 (5 pts) Draw a diagram to briefly explain the main idea of Simple Hash Join.

13

NAME: _______________________ UCI ID: _______________________

Suppose R is the smaller relation. In step 1, read the records of R page by page. For each record, apply a hash function h. If the hash value h(x) is 0, keep it in memory to build a hash table. Otherwise, pass it to the disk through a buffer page. In step 2, read the records of S page by page. For each record, apply the same hash function h. If the hash value h(x) is 0, do a lookup in the in-memory hash table and find matching results, which are output as results though one page buffer. Otherwise, pass it to the disk through one page buffer. Repeat step 1 and step 2 for those passed-over records on the disk using a sequence of different hash functions, until the remaining disk records of R can fit into memory. Then load these pages into memory to build a hash table, and scan the disk pages of S to do the in-memory join. 7.5 (5 pts) Draw a diagram to briefly explain the main idea of Hybrid Hash Join. In

addition, explain how this algorithm combines the ideas from the grace hash join and simple hash join.

14

NAME: _______________________ UCI ID: _______________________

Suppose R is the smaller relation. In step 1, read the records of R page by page. For each record, apply a hash function h. If the hash value h(x) is 0, keep it in memory to build a hash table for partition R0. Otherwise, pass it to the buffer page i = h(x), which will eventually flushed to the disk to generate partition Ri. In step 2, read the records of S page by page. For each record, apply the same hash function h. If the hash value h(x) is 0, do a lookup in the in-memory hash table of R0 and find matching results, which are output as results though one page buffer. Otherwise, pass it to the buffer page i = h(x), which will eventually flushed to the disk to generate partition Si. Use the second step of grace join to join each (Ri, Si) partition pairs. This join algorithm combines the idea of “keeping one partition in memory to reduce their disk IOs” from Simple hash join and the idea of “partitioning those remaining records using multiple output buffers” from Grace hash join. Question 8 (14 points): Query Optimization 8 8.1 (4 pts) Give two reasons why the System-R query optimizer only considers

left-deep join plans during optimization.

15

NAME: _______________________ UCI ID: _______________________

(1) A left-deep plan has a better chance to allow the output of each operator to be pipelined into the next operator without storing it in a temporary relation. (2) Reduce search space. Consider the following relations about movies, actors, studios, and their relationships, together with available B+ tree indexes:

● Movies (mid INTEGER, title VARCHAR(30), year INTEGER) o Clustered index on 'mid' (meaning “movie id”); o Unclustered index on 'year';

● Actors (aid INTEGER, name VARCHAR(30), gender INTEGER)

o Clustered index on 'aid' (meaning “actor id”); o Unclustered index on 'name';

● ActorInMovies (aid INTEGER, mid INTEGER)

o Clustered index on 'mid'; o Unclustered index on 'aid'; o Foreign key on 'aid' references Actors(aid); o Foreign key on 'mid' references Movies(mid);

● Studios (sid INTEGER, name VARCHAR(30), address VARCHAR(50))

o Clustered index on 'sid' (meaning “studio id”); o Unclustered index on 'name';

● MovieByStudios (mid INTEGER, sid INTEGER)

o Clustered index on 'mid'; o Unclustered index on 'sid'; o Foreign key on 'mid' references Movies(mid); o Foreign key on 'sid' references Studios(sid);

The meanings of the tables and attributes are self-explanatory. Consider the following query:

SELECT aid, SUM(M.revenue) FROM Movies M, ActorInMovies AM, Actors A, MovieByStudios MS, Studios S WHERE M.mid=AM.mid AND AM.aid=A.aid AND M.mid=MS.mid AND

MS.sid=S.sid AND M.year=2019 AND S.name=’Disney’ GROUP BY A.aid

We want to use the techniques in the System-R optimizer to generate an efficient physical plan. 8.2 (6 pts) For each of the following tables, write down all the access methods

considered by the optimizer. Specify which of them will be considered for the next

16

NAME: _______________________ UCI ID: _______________________

phase and explain why. Write down all the available interesting orders for each relation.

● Movies

1) B+ tree scan on mid, interesting order on pid; kept; 2) B+ tree search on year for the condition “year = 2019”; kept if its cost is less than the access method 1) and 4). 3) B+ tree search on mid for a constant, kept for a later index-based join 4) Full scan on table; kept if its cost is less than 1) and 3)

● Actors 1) B+ tree scan on aid, interesting order on aid; kept; 2) B+ tree search on aid for a constant, kept for a later index-based join

8.3 (4 pts) Explain how the optimizer generates efficient access methods for joining the tables Movies and ActorInMovies. No need to show all the enumerations.

1) Movies join ActorInMovies: For each access method on Movies kept from the previous step, for each access method on ActorInMovies kept from the previous step, consider all possible valid join methods: block nested loop join, index-based join, sort merge join, hash join, etc. Estimate the cost of each subplan. Use the interesting order from the previous method, if any, when estimating the cost of a sort-merge join. For each interesting order, select the join method with the smallest cost, and remove those subplans that are dominated by another subplan in terms of both cost and interesting order(s). 2) Repeat the same step for ActorInMovies join Movies;

17

Documents

CS222/122C – Fall 2019, Final Exam with Solutions